Performance Lab
Notes
-
Due date: Nov 22th 2024, 11:55pm PST
- This is an individual assignment – all code must be your own – no code sharing.
It is fine to discuss optimizations with another student, but if you do, please write their name as a collaborator at the top of kernels.c. - We made a few additions to this lab this year … apologies in advance for any problems.
- Lab download / scoreboard website
- After you untar the lab download, please do not rename the folder name and keep it as “target-NUMBER”.
- Lab TA: Jake
Introduction
In this lab, your job is to optimize two simple “kernels” to make them faster and more energy efficient!
More specifically, given a suboptimal implementation of four kernels below, your goal is to utilize the optimization techniques you learned from class to speed up the calculation and improve “energy efficiency”. In this class, we approximate energy efficiency as the reduction in data movement across different levels of the memory hierarchy (i.e. so you should try to hit in the L1 cache as much as possible).
You’ll have to use some optimization techniques covered in the class, especially the cache and memory optimization ones.
Kernel 1: Compute absolute value (warm up)
Compute the elementwise absolute value of a 1D floating point vector.
Hints
- The control here is unpredictable, causing performance problems. Can you do this without control?
- Don’t worry about vector instructions for this one.
Kernel 2: 2D Transpose
In 2D transpose, we take a 2D input matrix (Ni by Nj), and swap its rows and columns to create a 2D output matrix (Nj by Ni).
We can define the transpose mathematically as follows:
\[Out_{i,j} = In_{j,i}\]Hints
- Think about the cache access pattern – it seems like we get bad spatial locality on either the In or Out array.
- Can we use tiling to help?
- Don’t worry about vector instructions for this one.
Kernel 3: Complex Matrix-Vector Multiplication
Here we are doing a matrix multiplication (similar to the one in class) but it will be complex matrix multiplication.
\[\displaystyle Out_{i} = \displaystyle \sum_{x=0}^{N_{j}} In_{j} \times compArr_{i,j}\]Hints
- The complex values are defined with a struct (see
kernels.h
). Think about the access pattern that is created when multiplying the values due to this struct. compArr
is a global variable that does not change. Can you think of a way to reformat this into a new datastructure to avoid the nasty access pattern?- Specifically you can use the init() function to preprocess the
compArr
, and feel free to add global variables.
Kernel 4: 3D Stencil
Stencil algorithms “apply” a stencil (sometimes called a filter) to an input array to produce an output array. Stencil methods are quite common in image processing and computer simulations.
This stencil algorithm involves 3 arrays:
- “Input”: This is a large 3D array of input floating point values, with dimension sizes Ni, Nj, Nk. It has some padding in each dimension, based on the size of the Stencil array.
- “Stencil”: This is a smaller array of floating point values, with size S in all dimensions, which gets shifted over the input.
- “Output”: This is the same size as the input array, but without padding.
Below is a visualization of the three arrays involved:
Intuitively, the stencil algorithms shift the stencil around to every position on the input grid. For each position, we perform a dot product of the cells of Stencil and Input, and this result is written in the corresponding position of the Output array.
In this lab, we’ll define the 3D stencil for each cell of the output as follows:
\[\displaystyle Out_{i,j,k} = \displaystyle \sum_{x=0}^{S} \sum_{y=0}^{S} \sum_{z=0}^{S} In_{i + x,j +y, k + z} \times Stencil_{x,y,z}\]Below we depict the computation process required for the first three pixels along the x dimension.
Note that there is significant data-reuse between different computations. The stencil values are reused completely at each iteration. On the second step (for $Output_{0,0,1}$), half of the cells used in the first step are reused.
Hints
- Create a distinct version of the code specifically for each problem size (Ni,Nj…). You may assume the problem sizes don’t change. When more of the variables are constants, the compiler can generate more efficient code.
- There is so much reuse that tiling is not so important here. (unless you really want to get extra credit, then maybe…)
- The loop ordering matters a lot, though. We really need to order the loops to get good temporal and spatial locality.
- It will help you play with the loop ordering if you zero-out the Output array in a separate loop.
- GCC is definitely capable of vectorizing this code. Make sure that the statistics indicate that you are executing vector instructions – if not, your inner loop might not be chosen well to get contiguous access.
Scoring your Performance / Energy
For performance, we measure execution time. For energy efficiency, we are approximating it with “memory energy” for simplicity. Specifically, we use run your program with performance counters, and then multiply each access to L1/L2/L3 and DRAM by a somewhat arbitrary constant. The memory energy is defined as:
Mem Energy = #L1_Accesses * 0.05nJ + #L2_Accesses * 0.1J + #L3_Accesses * .5nJ + #DRAM_Accesses * 10.0nJ
To get your final score, we will run your code with 4 different sets of matrix sizes. Test case 1 is for transpose, and test cases 2-4 are for stencil. To combine your performance and energy scores into a single metric, we first take the geometric mean of your performance and energy improvement scores. Then we compute “energy delay improvement”, which multiplies your geometric mean performance and energy benefit together:
\[energy\_delay\_improvement = geomean(performance\_improvements) \times geomean(energy\_improvements)\]We will use the energy-delay improvement to calculate your score according to the following curve (note the opportunity for extra credit):
- Energy Delay Improvement = 10 -> Grade = 60
- Energy Delay Improvement = 20 -> Grade = 80
- Energy Delay Improvement = 30 -> Grade = 100
- Energy Delay Improvement = 110 -> Grade = 115
Rules
- All of your code should be in kernels.c. Any changes to the Makefile will be ignored. Do not rename any files.
- You may not write any assembly instructions.
- You may not use malloc or other alloc.
- No external libraries may be used in kernels.c. (you can write your own functions of course). This also means you shouldn’t fork around.
- Please no hacking the server! We are trusting you enough to run your code, so please be nice. It is not an achievement to hack someone when they let you run your own code. : p
Downloading and running the Lab
You can use any lnxsrv machine to edit your files (i.e. keep using lnxsrv06 with VScode). But for running the code, you should use lnxsrv07 for this lab, as the cache profiling only works on this machine. Having some/most people use lnxsrv06 for editing code would actually be helpful, because it would help spread out the load between lnxsrv06 and lnxsrv07.
Step-1: Go to the lab website on your browser. Click into the “Request Lab” button on the webpage, insert the requested information to get a tarball. Then Secure copy(upload) the tarball from your local machine to the seas machines(using scp command on the command line or using other file transfer tools such as CyberDuck). In the terminal you can do the following:
scp <yourlocalfile> <yourusername>@lnxsrv07.seas.ucla.edu:~/<yourdir>
Step-2: Log onto the lnxsrv07 server either using a terminal or your favorite text editor (VScode, etc), then go into the directory where the tarball was uploaded. Then untar the tarball. Note: YOURNUMBER should be replaced with the actual number assigned to your tarball.
ssh <yourusername>@lnxsrv07.seas.ucla.edu
tar -xzf target-YOURNUMBER.tar.gz
cd target-YOURNUMBER
Step-3: Now you can start optimizing the code contained in kernels.c. Once you’re done optimizing, you can check the correctness and the performance of you code by “making” your code and running the following commands:
make
./perflab
Make sure not to type “make perflab” or “make ./perflab” all one one line, because this will cause you to not compile everything (there’s another binary called “plain” that needs to get compiled as well).
You can also run a specific test input with the -i flag, this one runs test 2:
./perflab -i 2
The lab will generate a report for you that indicates your performance. If you run “locally” like above, you will run on lnxsrv7, which will only give you an approximate grade, since it is not the machine the grading server runs on.
Step-4: In order to be graded, you need to submit your work to the grading server, to do so, simply do:
make submit
The server will give you a similar version of the performance statistics that you get by running locally. This new submission will also be reflected on the scoreboard. We recommand testing your code for correctness on lnxsrv07, and then submit your code for grading for accurate performance analysis. The grading server only runs one submission at a time on the machine, so there can be some delay before we run your code. The current submission queue approaximate wait time could be seen by two ways: 1.on the scoreboard 2.by typing:
make ping
in your target directory. We recommend ping the server to see the current workload before submitting, in case the server is loaded , please try to submit later.
Note:
- If you want a comment to be shown on the scoreboard, you can write your comment in the comment.txt file, please keep it below 25 characters. If you attempt to insert a super long string, that will be rejected by the server.
- We will count your best submission for your grade. You can submit multiple times, and we will take your best one. The score on the scoreboard is the score you will get (unless we found you didn’t follow the rules).
- The cache profiling code only works on lnxsrv07, as it is specific to the processor on that machine.
Thus, we recommend using lnxsrv07.seas.ucla.edu for correctness and for testing the performance of your code before submitting.
You can use any lnxsrv machine to log into for VSCODE. Load balancing is helpful, thank you! - Make sure you save your progress often, things can go wrong at any time, you always want to have a backup of previous editions of your code. Check out version control tools, such as Git.
Interpreting Output
After you optimize your code a bit, you might see the following output:
T# Kernel | Ni Nj Nk S | Time(ms) CPE #Ins PE #VecInsPE | L1 KPE L2 KPE L3 KPE MEM KPE MemEn(j) | Speedup Energyup
0 Abs | 4096 - - - | 8.961 0.44 1.163 0.099 | 206.9 0.1 0.0 0.000 0.000400 | 1.0 1.0
1 Transp. | 10000 10000 - - | 257.702 5.15 4.144 0.000 | 2696.7 77.6 78.4 15.794 0.033974 | 2.3 2.6
2 MatVec. | 1000 1000 - - | 134.690 33.67 27.019 0.740 | 4065.2 1381.1 476.5 0.000 0.004637 | 3.2 3.0
3 Stencil | 128 128 128 8 | 169.998 0.32 0.631 0.244 | 313.0 5.3 3.3 0.027 0.019454 | 25.0 17.5
4 Stencil | 64 64 64 20 | 385.533 0.37 1.058 0.249 | 385.5 1.1 0.3 0.001 0.040971 | 20.8 31.0
5 Stencil | 4 4 1048576 2 | 309.817 4.62 1.774 0.245 | 506.1 332.4 245.5 14.562 0.043878 | 2.7 3.2
Geomean Speedup: 4.65
Geomean Energyup: 4.90
Energy-Delay Improvement: 22.78
Grade: 85.6
The first set of columns show the test number and kernel name. The next set of columns show the problem parameters.
Then the next parameters are:
- Time(ms) – Total execution time
- CPE – Cycles per Element
- #Inst(M) – Instructions per Element
- #VecInst(M) – Total SIMD/vector instructions per Element
- #L2/L3/L4/Mem KPE – Number of L1/L2/L3/DRAM accesses per 1 thousand Elements
- MemEn(j) – “Memory Energy” in “Joules” (lol)
- Speedup – Speedup of your program compared to the original code run on our grading server
- Energyup – Energy Improvement of your program compared to the original code run on our grading server
After that is the geomean performance and energy improvements, and grade estimate, as discussed earlier.
Hints and Advice
Things you should do for sure, and like right away:
- Inline all the functions! This reduces function call overhead, and enables the compiler to reason about the program region.
- Create a special version of each kernel that is specific to the matrix dimensions in each test case. This has two benefits: the compiler can create simpler code because it knows what the matrix dimensions are, and you can also optimize for each kernel differently. E.g. in the compute_stencil function, check the input arguments to see if it matches one of the test cases, then call a version specific to that test case.
- Try different loop orderings. Think about which ordering gives you the best spatial locality. Spatial locality is good for cache performance and the compiler will naturally vectorize loops with good locality.
- You can also try adding the “restrict” keyword in the first (outer) dimension of your array – this tells the compiler that the corresponding memory arrays don’t overlap, and could make the compilers job easier. I’ve seen this help a little on a couple of the stencil cases.
- Syntax:
func(float a[restrict N][N][N] ...)
- Syntax:
General Advice:
-
Every time you try something, check the stats (#instructions, cache accesses, etc.). See if what you did makes sense in the context of those stats – that will help you build up an understanding of what is working and why. I.e. I did loop tiling, and the number of L2 cache accesses went down because there’s better spatial locality for L1. (or maybe it didn’t go down, meaning you didn’t tile correctly, or the tiling size is wrong, etc.)
-
You’re going to see a lot of performance variation on multiple different runs. This is unavoidable – we are running on a real machine where OS-level events can affect your performance, as well as contention for resources from other running programs.
- Sometimes, simpler is better for the compiler.
- Some optimizations can be done effectively by the compiler, so you should only do those yourself in the code if the compiler is not doing a good job. I.e. if you try something and it doesn’t help, then roll back and keep the simpler version! (PLEASE don’t show us your ugly code, that will make us very sad.)
- E.g. we showed you how to unroll, use separate accumulators etc… but the compiler is like, realllly good at these most of the time. Sometimes, unrolling can make things worse for the compiler because the code gets more complex.
- Overall, don’t forget to simplify when your optimizations don’t work. As Donald Knuth said, “Premature optimization is the root of all evil in programming.”
-
Sometimes a recommended optimization (i.e. in the hints section) may slightly lower the performance for a particular case. This doesn’t mean it’s a bad optimization, it just means there was some interaction with your program and the compiler, or maybe there was some performance variability between runs. Keep going and implement the rest of the suggested optimizations, and you will get a good score. Also, you may need to play with the parameters of the optimization for it to work well (e.g. tile size).
-
When lnxsrv07 is highly loaded, running locally will not produce accurate results – this will likely be the case near the deadline of this lab. However, our grading server is not running on lnxsrv07, and it will report accurate results (with some variation of course) – so you can use this to measure your performance if you like. However, there will be a wait-queue, because only one person can run at a time.
- Start early, so you have enough time to finish and submit to the server, as the deadline approaches, there will likely be much longer wait times for submission.
Things you could do but probably shouldn’t do unless you’re a stubbornly curious person and you have time to burn:
-
Manually unroll the code and/or use manual accumulators because you think you know better than the compiler. (But maybe sometimes you accidentally get lucky and do help the compiler)
-
Try to use tiling to improve temporal locality as well (for stencil I mean). Intuitively, tiling can prevent you from going down a really long dimension without getting any reuse in your caches. What should your tile size be? Try to calculate the size of all the data you access within some inner loops – make sure this is less than the cache size you are targeting.
-
The compiler will try to vectorize your code, and it’s kind of interesting to see why it did or didn’t vectorize. You can dump the (ugly) vectorization report by passing the flag
-fopt-info-vec-all
to GCC when compiling stencil.c. There is a comment in the Makefile that points out where to add the flag. Just look for the line number of your loop to see if it got vectorized or not. Sometimes it might vectorize an outer loop. -
Go look at the generated code by opening up the assembly of kernels.o (it’s in the build/ folder) using objdump or gdb. You might find out that certain program transformations help the compiler generate better code. However, looking at the assembly for these programs at high optimization levels may drive you to madness. So really proceed with caution.
CPU Microarchitecture Parameters
The machine we’re using is the: Intel(R) Xeon(R) CPU E5-2640 v2
It shouldn’t be necessary to know the specific CPU properties to get a good grade; however, it could be useful under some advanced optimizations, and also just for understanding your results in general.
Compute
This macine supports 256-bit wide vector instructions called “AVX”. This means that it’s possible for it to use a single AVX instruction to do 8 float multiply-accumulates per cycle (as 8 * 32-bits = 256).
Cache
The cache block size is 64 bytes. Here’s the size and associativity of the three levels of cache:
- L1: 32KiB & 8-way associativity
- L2: 256KiB & 8-way associativity
- L3: 20MB & 16-way associativity