top of page

Lab5 Working with Profilers

Updated: Oct 3, 2018

The first program we will be profiling calculates the 1st array by randomly generating a number for each element in the array.


The second array is generated by calling a separate function to calculate each element in the array one by one.


The sample size of 500,000 was to small to measure so i added 2 0's to make it 50,000,000. The average run time is around 4 seconds, with no other users on the machine and very low system load.


****** Using the gprof profiler *****************


I compiled the program using : gcc -pg -o2 $sourceCode -o $outputName

Note: Switching the -o2 to -o(0,1,2,3) did very little to change the speed of the application.


Using the time command on 3 separate runs, each with me being the only user online and a system load after using the cmd uptime of: 23:17:46 up 24 days, 1:36, 1 user, load average: 0.04, 0.07, 0.02 ,


Result in Average run time 4.5 seconds

Each sample counts as 0.01 seconds.


--How to use the GPROF profiler

Need to compile with the -pg flag\

Execute your program normally, this will create a file name gmon.out

Execute the following cmd: gprof $executable gmon.out

The following graphs below are created


Flat Profile:

% cumulative self self total

time seconds seconds calls ns/call ns/call name

86.21 0.80 0.80 main

11.93 0.91 0.11 50000000 2.22 2.22 scale_sample


Call Grpah

index % time self children called name

<spontaneous>

[1] 100.0 0.86 0.08 main [1]

0.08 0.00 50000000/50000000 scale_sample [2]

-----------------------------------------------

0.08 0.00 50000000/50000000 main [1]

[2] 8.6 0.08 0.00 50000000 scale_sample [2]

===============================================

Analysis of GPROF


The first profile the "flat profile" provides us with a synopsis of the actual time as well as the time represented in % being used in each function of the program.

Calls is how many times the function was called

ns/call represents the average time spend inside the function per call.


The call graph is similar to the flat graph but looks like there are 2 differences. The first is the children column, which represents how long a child process takes from a certain function. The second is that the "called" column is now represented with a / slash.

===============================================

===============================================

*****************Using perf*****************


This is my first time using perf and it is not like gprof at all. This took me some time to get used to in terms of what i was looking at. From what i could devise it looks like perf deals more with the disassembly of the code than anything else.


--HOW TO USE PERF

command: perf record $executable

command: perf report $file


I found a great site online which is like a tutorial on PERF and it is helping me how to fully utilize the profiler. Here is the link -> http://sandsoftwaresound.net/perf/perf-tutorial-hot-spots/


So the first thing we need to do is to recompile our program according to the guide using the -ggdb option, it appears this flag will help PERF collect more data.


Now if we want to profile our program we can simply type into the shell:$ perf record $executable


This will give us a report which we can see by executing:$ perf record


It will show us something like this below.


--

29.03% vol1 libc-2.27.so [.] _mcount@@GLIBC_2.18

18.49% vol1 vol1 [.] main

4.19% vol1 libc-2.27.so [.] __random_r

1.94% vol1 vol1 [.] scale_sample

0.64% vol1 vol1 [.] _mcount@plt

0.62% vol1 vol1 [.] rand@plt

0.16% vol1 libc-2.27.so [.] rand

0.09% vol1 [unknown] [k] 0xffff000008083514

0.00% vol1 libc-2.27.so [.] _dl_addr

0.00% vol1 ld-2.27.so [.] do_lookup_x

0.00% vol1 ld-2.27.so [.] strchr

0.00% vol1 ld-2.27.so [.] memcpy

0.00% vol1 ld-2.27.so [.] _dl_start

0.00% vol1 ld-2.27.so [.] _start

0.00% vol1 [unknown] [k] 0xffff0000080814dc

--

As you can tell this gives us a lot more depth into what is going on in the program. We have a mixture of functions we the user wrote and also functions which the cpu initialized without us even knowing. The above is actually a selection screen, so if we hover over our scale_sample function and hit enter it will take us to another selection screen as shown below.


--

Annotate scale_sample ◆

Zoom into vol2 thread ▒

Zoom into vol2 DSO ▒

Browse map details ▒

Run scripts for samples of symbol [scale_sample] ▒

Run scripts for all samples ▒

Switch to another data file in PWD ▒

Exit

------------------


Focusing on just the first option "Annotate scale_sample" will take us to a separate screen with a look at what is actually going on inside the function in disassembled language as seen below.

--------------------

Percent│

│ Disassembly of section .text:

│ 0000000000400674 <scale_sample>:

│ scale_sample():

#include <stdint.h>

#include "vol.h"

│ // Function to scale a sound sample using a volume_factor

│ // in the range of 0.00 to 1.00.

│ static inline int16_t scale_sample(int16_t sample, float volume_factor) {

12.39 │ sub sp, sp, #0x10

0.37 │ strh w0, [sp, #14]

12.27 │ str s0, [sp, #8]

│ return (int16_t) (volume_factor * (float) sample);

62.09 │ ldrsh w0, [sp, #14]

│ scvtf s1, w0

1.35 │ ldr s0, [sp, #8]

│ fmul s0, s1, s0

│ fcvtzs w0, s0

│ sxth w0, w0

│ }

11.53 │ add sp, sp, #0x10

│ ← ret

------------------


So from the above output we can tell exactly which part of the function is taking the most time to execute according to the cpu. These commands are written in assembly language so they do not represent our original source code but rather what is going on inside the CPU.


------------------------------------

MODIFYING THE PERF RECORD COMMAND below

------------------------------------


The other interesting facet i found with PERF is that if i modify the cmd:$ perf record $program and append a "-e" i can profile specific hardware cache, kernel, software or hardware events.


So if i enter the command:$perf record -e cpu-clock,faults ./vol2


Now the profiler will collect data on cpu-clock and page faults ONLY, which means all the profiling data we collected above won't be collected in this iteration. You can also find additional parameters to append to the "-e" option by entering the cmd:$ perf list

This is a great option for specific resource consumption i find.


------------------------------------

MODIFYING THE PERF REPORT COMMAND below

------------------------------------


The perf report --stdio will get us all the functions associated with the kernel and user space.


# Samples: 13K of event 'cpu-clock:u'

# Event count (approx.): 3282750000

#

# Overhead Command Shared Object Symbol

# ........ ....... ............. .........................

#

56.61% vol2 libc-2.27.so [.] __random

31.51% vol2 vol2 [.] main

5.33% vol2 libc-2.27.so [.] __random_r

5.24% vol2 vol2 [.] scale_sample

0.78% vol2 vol2 [.] rand@plt

0.30% vol2 libc-2.27.so [.] __GI___memset_generic

0.21% vol2 libc-2.27.so [.] rand

0.01% vol2 ld-2.27.so [.] dl_main

0.01% vol2 ld-2.27.so [.] strcmp


---

The above output is the amount of time per function occupying the cpu-clock's resources and a separate output would also be provided for the page-faults which I did not include in this example.


====================================================================

NEXT WE WILL COMPARE THESE RESULTS TO A DIFFERENT ALGORITHM

====================================================================


-> This is how the old algorithm calculated the output.


We have 2 arrays, in and out.

The "in" array is filled with random numbers, the "out" array is calculated by calling a function which calculates the value of the "in" array by 0.75 which represents the sample size. This is done recursively inside of a for loop, so this function is called 50 million times based on our array size.


-> For the new algorithm

We have 2 arrays, in and out.

The "in" array is filled with random numbers, the "out" array is calculated by calling a function which calculates all the values inside of the array rather than bouncing back and forth between "main" and the function being called. This variation is pre-calculating all the values instead of calculating them on demand.



Our newly defined function below looks like:

int16_t* calculate_sample(int16_t* inn, float sample)

{

int16_t* outt;

outt = (int16_t*) calloc(SAMPLES, sizeof(int16_t));


for (int x = 0; x < SAMPLES; x++) {

outt[x] = inn[x] * sample;

}


return outt;

}



Running the program after compilation with the time command, with a system load of [jadach@aarchie b_lab5]$ uptime

11:00:39 up 25 days, 13:19, 2 users, load average: 0.00, 0.00, 0.00


Presents the following output


[jadach@aarchie b_lab5]$ time ./vol1

Result: 416


real 0m3.263s

user 0m3.089s

sys 0m0.170s

[jadach@aarchie b_lab5]$ time ./vol1

Result: 416


real 0m3.243s

user 0m3.119s

sys 0m0.120s

[jadach@aarchie b_lab5]$ time ./vol1

Result: 416


real 0m3.232s

user 0m3.107s

sys 0m0.120s

[jadach@aarchie b_lab5]$ time ./vol1

Result: 416


real 0m3.233s

user 0m3.068s

sys 0m0.160s



--------

So just from pre-calculating our array in this new program we managed to execute the program in 3 mins and 25 seconds on average VS 4 mins and 25 seconds. Providing us with a 22% speed increase.



-------------------------------------

ANALYZING WITH GPROF

--------------------------------------


Flat profile:


Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls ms/call ms/call name

57.50 0.66 0.66 main

43.34 1.15 0.49 1 494.12 494.12 calculate_sample


Call graph (explanation follows)

granularity: each sample hit covers 2 byte(s) for 0.87% of 1.15 seconds

index % time self children called name

<spontaneous>

[1] 100.0 0.66 0.49 main [1]

0.49 0.00 1/1 calculate_sample [2]

-----------------------------------------------

0.49 0.00 1/1 main [1]

[2] 43.0 0.49 0.00 1 calculate_sample [2]


The amount of time spent in the function VS main is a lot higher now compared to our original algorithm but the application is much faster, I suspect this is due to less time being spent bouncing between memory locations.


=====================================================================

CONCLUSION

=====================================================================


We were able to improve the performance of the program by altering one of the algorithms. Pre-calculating the array in the new function caused less jumps between memory which reduced the time spent by the CPU fetching data from memory.




13 views0 comments

Recent Posts

See All

Closing Thoughts

For my final blog post I would like to discuss what I have learned and plan to utilize in the future from this course. So although I was not able to successfully improve my package to operate function

Stage 3 Optimization(COMPUTER ARCHITECTURE ENDIANESS)

Seeing as how the compiler flags did not provide any optimization I will on to my next attempt which is converting big endian to small endian. The aarch64 architecture uses the little endian byte orde

Stage 3 Optimization(Compiler Flags)

My first attempt to optimize the project will be to work with the compiler flag options. By default the compiler is set to compile in this manner "gcc -E -g -o2" The -E option represents preprocesses

bottom of page