The first program we will be profiling calculates the 1st array by randomly generating a number for each element in the array.
The second array is generated by calling a separate function to calculate each element in the array one by one.
The sample size of 500,000 was to small to measure so i added 2 0's to make it 50,000,000. The average run time is around 4 seconds, with no other users on the machine and very low system load.
****** Using the gprof profiler *****************
I compiled the program using : gcc -pg -o2 $sourceCode -o $outputName
Note: Switching the -o2 to -o(0,1,2,3) did very little to change the speed of the application.
Using the time command on 3 separate runs, each with me being the only user online and a system load after using the cmd uptime of: 23:17:46 up 24 days, 1:36, 1 user, load average: 0.04, 0.07, 0.02 ,
Result in Average run time 4.5 seconds
Each sample counts as 0.01 seconds.
--How to use the GPROF profiler
Need to compile with the -pg flag\
Execute your program normally, this will create a file name gmon.out
Execute the following cmd: gprof $executable gmon.out
The following graphs below are created
Flat Profile:
% cumulative self self total
time seconds seconds calls ns/call ns/call name
86.21 0.80 0.80 main
11.93 0.91 0.11 50000000 2.22 2.22 scale_sample
Call Grpah
index % time self children called name
<spontaneous>
[1] 100.0 0.86 0.08 main [1]
0.08 0.00 50000000/50000000 scale_sample [2]
-----------------------------------------------
0.08 0.00 50000000/50000000 main [1]
[2] 8.6 0.08 0.00 50000000 scale_sample [2]
===============================================
Analysis of GPROF
The first profile the "flat profile" provides us with a synopsis of the actual time as well as the time represented in % being used in each function of the program.
Calls is how many times the function was called
ns/call represents the average time spend inside the function per call.
The call graph is similar to the flat graph but looks like there are 2 differences. The first is the children column, which represents how long a child process takes from a certain function. The second is that the "called" column is now represented with a / slash.
===============================================
===============================================
*****************Using perf*****************
This is my first time using perf and it is not like gprof at all. This took me some time to get used to in terms of what i was looking at. From what i could devise it looks like perf deals more with the disassembly of the code than anything else.
--HOW TO USE PERF
command: perf record $executable
command: perf report $file
I found a great site online which is like a tutorial on PERF and it is helping me how to fully utilize the profiler. Here is the link -> http://sandsoftwaresound.net/perf/perf-tutorial-hot-spots/
So the first thing we need to do is to recompile our program according to the guide using the -ggdb option, it appears this flag will help PERF collect more data.
Now if we want to profile our program we can simply type into the shell:$ perf record $executable
This will give us a report which we can see by executing:$ perf record
It will show us something like this below.
--
29.03% vol1 libc-2.27.so [.] _mcount@@GLIBC_2.18
18.49% vol1 vol1 [.] main
4.19% vol1 libc-2.27.so [.] __random_r
1.94% vol1 vol1 [.] scale_sample
0.64% vol1 vol1 [.] _mcount@plt
0.62% vol1 vol1 [.] rand@plt
0.16% vol1 libc-2.27.so [.] rand
0.09% vol1 [unknown] [k] 0xffff000008083514
0.00% vol1 libc-2.27.so [.] _dl_addr
0.00% vol1 ld-2.27.so [.] do_lookup_x
0.00% vol1 ld-2.27.so [.] strchr
0.00% vol1 ld-2.27.so [.] memcpy
0.00% vol1 ld-2.27.so [.] _dl_start
0.00% vol1 ld-2.27.so [.] _start
0.00% vol1 [unknown] [k] 0xffff0000080814dc
--
As you can tell this gives us a lot more depth into what is going on in the program. We have a mixture of functions we the user wrote and also functions which the cpu initialized without us even knowing. The above is actually a selection screen, so if we hover over our scale_sample function and hit enter it will take us to another selection screen as shown below.
--
Annotate scale_sample ◆
Zoom into vol2 thread ▒
Zoom into vol2 DSO ▒
Browse map details ▒
Run scripts for samples of symbol [scale_sample] ▒
Run scripts for all samples ▒
Switch to another data file in PWD ▒
Exit
------------------
Focusing on just the first option "Annotate scale_sample" will take us to a separate screen with a look at what is actually going on inside the function in disassembled language as seen below.
--------------------
Percent│
│
│
│ Disassembly of section .text:
│
│ 0000000000400674 <scale_sample>:
│ scale_sample():
│ #include <stdint.h>
│ #include "vol.h"
│
│ // Function to scale a sound sample using a volume_factor
│ // in the range of 0.00 to 1.00.
│ static inline int16_t scale_sample(int16_t sample, float volume_factor) {
12.39 │ sub sp, sp, #0x10
0.37 │ strh w0, [sp, #14]
12.27 │ str s0, [sp, #8]
│ return (int16_t) (volume_factor * (float) sample);
62.09 │ ldrsh w0, [sp, #14]
│ scvtf s1, w0
1.35 │ ldr s0, [sp, #8]
│ fmul s0, s1, s0
│ fcvtzs w0, s0
│ sxth w0, w0
│ }
11.53 │ add sp, sp, #0x10
│ ← ret
------------------
So from the above output we can tell exactly which part of the function is taking the most time to execute according to the cpu. These commands are written in assembly language so they do not represent our original source code but rather what is going on inside the CPU.
------------------------------------
MODIFYING THE PERF RECORD COMMAND below
------------------------------------
The other interesting facet i found with PERF is that if i modify the cmd:$ perf record $program and append a "-e" i can profile specific hardware cache, kernel, software or hardware events.
So if i enter the command:$perf record -e cpu-clock,faults ./vol2
Now the profiler will collect data on cpu-clock and page faults ONLY, which means all the profiling data we collected above won't be collected in this iteration. You can also find additional parameters to append to the "-e" option by entering the cmd:$ perf list
This is a great option for specific resource consumption i find.
------------------------------------
MODIFYING THE PERF REPORT COMMAND below
------------------------------------
The perf report --stdio will get us all the functions associated with the kernel and user space.
# Samples: 13K of event 'cpu-clock:u'
# Event count (approx.): 3282750000
#
# Overhead Command Shared Object Symbol
# ........ ....... ............. .........................
#
56.61% vol2 libc-2.27.so [.] __random
31.51% vol2 vol2 [.] main
5.33% vol2 libc-2.27.so [.] __random_r
5.24% vol2 vol2 [.] scale_sample
0.78% vol2 vol2 [.] rand@plt
0.30% vol2 libc-2.27.so [.] __GI___memset_generic
0.21% vol2 libc-2.27.so [.] rand
0.01% vol2 ld-2.27.so [.] dl_main
0.01% vol2 ld-2.27.so [.] strcmp
---
The above output is the amount of time per function occupying the cpu-clock's resources and a separate output would also be provided for the page-faults which I did not include in this example.
====================================================================
NEXT WE WILL COMPARE THESE RESULTS TO A DIFFERENT ALGORITHM
====================================================================
-> This is how the old algorithm calculated the output.
We have 2 arrays, in and out.
The "in" array is filled with random numbers, the "out" array is calculated by calling a function which calculates the value of the "in" array by 0.75 which represents the sample size. This is done recursively inside of a for loop, so this function is called 50 million times based on our array size.
-> For the new algorithm
We have 2 arrays, in and out.
The "in" array is filled with random numbers, the "out" array is calculated by calling a function which calculates all the values inside of the array rather than bouncing back and forth between "main" and the function being called. This variation is pre-calculating all the values instead of calculating them on demand.
Our newly defined function below looks like:
int16_t* calculate_sample(int16_t* inn, float sample)
{
int16_t* outt;
outt = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
for (int x = 0; x < SAMPLES; x++) {
outt[x] = inn[x] * sample;
}
return outt;
}
Running the program after compilation with the time command, with a system load of [jadach@aarchie b_lab5]$ uptime
11:00:39 up 25 days, 13:19, 2 users, load average: 0.00, 0.00, 0.00
Presents the following output
[jadach@aarchie b_lab5]$ time ./vol1
Result: 416
real 0m3.263s
user 0m3.089s
sys 0m0.170s
[jadach@aarchie b_lab5]$ time ./vol1
Result: 416
real 0m3.243s
user 0m3.119s
sys 0m0.120s
[jadach@aarchie b_lab5]$ time ./vol1
Result: 416
real 0m3.232s
user 0m3.107s
sys 0m0.120s
[jadach@aarchie b_lab5]$ time ./vol1
Result: 416
real 0m3.233s
user 0m3.068s
sys 0m0.160s
--------
So just from pre-calculating our array in this new program we managed to execute the program in 3 mins and 25 seconds on average VS 4 mins and 25 seconds. Providing us with a 22% speed increase.
-------------------------------------
ANALYZING WITH GPROF
--------------------------------------
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
57.50 0.66 0.66 main
43.34 1.15 0.49 1 494.12 494.12 calculate_sample
Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) for 0.87% of 1.15 seconds
index % time self children called name
<spontaneous>
[1] 100.0 0.66 0.49 main [1]
0.49 0.00 1/1 calculate_sample [2]
-----------------------------------------------
0.49 0.00 1/1 main [1]
[2] 43.0 0.49 0.00 1 calculate_sample [2]
The amount of time spent in the function VS main is a lot higher now compared to our original algorithm but the application is much faster, I suspect this is due to less time being spent bouncing between memory locations.
=====================================================================
CONCLUSION
=====================================================================
We were able to improve the performance of the program by altering one of the algorithms. Pre-calculating the array in the new function caused less jumps between memory which reduced the time spent by the CPU fetching data from memory.
Comments