Performance Lab

Notes

Introduction

In this lab, your job is to optimize two simple “kernels” to make them faster and more energy efficient!

More specifically, given a suboptimal implementation of four kernels below, your goal is to utilize the optimization techniques you learned from class to speed up the calculation and improve “energy efficiency”. In this class, we approximate energy efficiency as the reduction in data movement across different levels of the memory hierarchy (i.e. so you should try to hit in the L1 cache as much as possible).

You’ll have to use some optimization techniques covered in the class, especially the cache and memory optimization ones.

Kernel 1: Compute absolute value (warm up)

Compute the elementwise absolute value of a 1D floating point vector.

Hints

Kernel 2: 2D Transpose

In 2D transpose, we take a 2D input matrix (Ni by Nj), and swap its rows and columns to create a 2D output matrix (Nj by Ni).

We can define the transpose mathematically as follows:

\[Out_{i,j} = In_{j,i}\]

Hints

Kernel 3: Complex Matrix-Vector Multiplication

Here we are doing a matrix multiplication (similar to the one in class) but it will be complex matrix multiplication.

\[\displaystyle Out_{i} = \displaystyle \sum_{x=0}^{N_{j}} In_{j} \times compArr_{i,j}\]

Hints

Kernel 4: 3D Stencil

Stencil algorithms “apply” a stencil (sometimes called a filter) to an input array to produce an output array. Stencil methods are quite common in image processing and computer simulations.

This stencil algorithm involves 3 arrays:

Below is a visualization of the three arrays involved:

Intuitively, the stencil algorithms shift the stencil around to every position on the input grid. For each position, we perform a dot product of the cells of Stencil and Input, and this result is written in the corresponding position of the Output array.

In this lab, we’ll define the 3D stencil for each cell of the output as follows:

\[\displaystyle Out_{i,j,k} = \displaystyle \sum_{x=0}^{S} \sum_{y=0}^{S} \sum_{z=0}^{S} In_{i + x,j +y, k + z} \times Stencil_{x,y,z}\]

Below we depict the computation process required for the first three pixels along the x dimension.

Note that there is significant data-reuse between different computations. The stencil values are reused completely at each iteration. On the second step (for $Output_{0,0,1}$), half of the cells used in the first step are reused.

Hints

Scoring your Performance / Energy

For performance, we measure execution time. For energy efficiency, we are approximating it with “memory energy” for simplicity. Specifically, we use run your program with performance counters, and then multiply each access to L1/L2/L3 and DRAM by a somewhat arbitrary constant. The memory energy is defined as:

Mem Energy = #L1_Accesses * 0.05nJ + #L2_Accesses * 0.1J + #L3_Accesses * .5nJ + #DRAM_Accesses * 10.0nJ

To get your final score, we will run your code with 4 different sets of matrix sizes. Test case 1 is for transpose, and test cases 2-4 are for stencil. To combine your performance and energy scores into a single metric, we first take the geometric mean of your performance and energy improvement scores. Then we compute “energy delay improvement”, which multiplies your geometric mean performance and energy benefit together:

\[energy\_delay\_improvement = geomean(performance\_improvements) \times geomean(energy\_improvements)\]

We will use the energy-delay improvement to calculate your score according to the following curve (note the opportunity for extra credit):

Rules

  1. All of your code should be in kernels.c. Any changes to the Makefile will be ignored. Do not rename any files.
  2. You may not write any assembly instructions.
  3. You may not use malloc or other alloc.
  4. No external libraries may be used in kernels.c. (you can write your own functions of course). This also means you shouldn’t fork around.
  5. Please no hacking the server! We are trusting you enough to run your code, so please be nice. It is not an achievement to hack someone when they let you run your own code. : p

Downloading and running the Lab

You can use any lnxsrv machine to edit your files (i.e. keep using lnxsrv06 with VScode). But for running the code, you should use lnxsrv07 for this lab, as the cache profiling only works on this machine. Having some/most people use lnxsrv06 for editing code would actually be helpful, because it would help spread out the load between lnxsrv06 and lnxsrv07.

Step-1: Go to the lab website on your browser. Click into the “Request Lab” button on the webpage, insert the requested information to get a tarball. Then Secure copy(upload) the tarball from your local machine to the seas machines(using scp command on the command line or using other file transfer tools such as CyberDuck). In the terminal you can do the following:

scp <yourlocalfile> <yourusername>@lnxsrv07.seas.ucla.edu:~/<yourdir>

Step-2: Log onto the lnxsrv07 server either using a terminal or your favorite text editor (VScode, etc), then go into the directory where the tarball was uploaded. Then untar the tarball. Note: YOURNUMBER should be replaced with the actual number assigned to your tarball.

ssh <yourusername>@lnxsrv07.seas.ucla.edu
tar -xzf target-YOURNUMBER.tar.gz 
cd target-YOURNUMBER

Step-3: Now you can start optimizing the code contained in kernels.c. Once you’re done optimizing, you can check the correctness and the performance of you code by “making” your code and running the following commands:

make 
./perflab

Make sure not to type “make perflab” or “make ./perflab” all one one line, because this will cause you to not compile everything (there’s another binary called “plain” that needs to get compiled as well).

You can also run a specific test input with the -i flag, this one runs test 2:

./perflab -i 2

The lab will generate a report for you that indicates your performance. If you run “locally” like above, you will run on lnxsrv7, which will only give you an approximate grade, since it is not the machine the grading server runs on.

Step-4: In order to be graded, you need to submit your work to the grading server, to do so, simply do:

make submit

The server will give you a similar version of the performance statistics that you get by running locally. This new submission will also be reflected on the scoreboard. We recommand testing your code for correctness on lnxsrv07, and then submit your code for grading for accurate performance analysis. The grading server only runs one submission at a time on the machine, so there can be some delay before we run your code. The current submission queue approaximate wait time could be seen by two ways: 1.on the scoreboard 2.by typing:

make ping

in your target directory. We recommend ping the server to see the current workload before submitting, in case the server is loaded , please try to submit later.

Note:

  1. If you want a comment to be shown on the scoreboard, you can write your comment in the comment.txt file, please keep it below 25 characters. If you attempt to insert a super long string, that will be rejected by the server.
  2. We will count your best submission for your grade. You can submit multiple times, and we will take your best one. The score on the scoreboard is the score you will get (unless we found you didn’t follow the rules).
  3. The cache profiling code only works on lnxsrv07, as it is specific to the processor on that machine.
    Thus, we recommend using lnxsrv07.seas.ucla.edu for correctness and for testing the performance of your code before submitting.
    You can use any lnxsrv machine to log into for VSCODE. Load balancing is helpful, thank you!
  4. Make sure you save your progress often, things can go wrong at any time, you always want to have a backup of previous editions of your code. Check out version control tools, such as Git.

Interpreting Output

After you optimize your code a bit, you might see the following output:

T#  Kernel |      Ni    Nj      Nk  S |   Time(ms)    CPE  #Ins PE  #VecInsPE |  L1 KPE  L2 KPE  L3 KPE MEM KPE MemEn(j) | Speedup Energyup
 0     Abs |    4096     -       -  - |      8.961   0.44    1.163      0.099 |   206.9     0.1     0.0   0.000 0.000400 |     1.0      1.0
 1 Transp. |   10000 10000       -  - |    257.702   5.15    4.144      0.000 |  2696.7    77.6    78.4  15.794 0.033974 |     2.3      2.6
 2 MatVec. |    1000  1000       -  - |    134.690  33.67   27.019      0.740 |  4065.2  1381.1   476.5   0.000 0.004637 |     3.2      3.0
 3 Stencil |     128   128     128  8 |    169.998   0.32    0.631      0.244 |   313.0     5.3     3.3   0.027 0.019454 |    25.0     17.5
 4 Stencil |      64    64      64 20 |    385.533   0.37    1.058      0.249 |   385.5     1.1     0.3   0.001 0.040971 |    20.8     31.0
 5 Stencil |       4     4 1048576  2 |    309.817   4.62    1.774      0.245 |   506.1   332.4   245.5  14.562 0.043878 |     2.7      3.2
Geomean Speedup: 4.65
Geomean Energyup: 4.90
Energy-Delay Improvement: 22.78
Grade: 85.6

The first set of columns show the test number and kernel name. The next set of columns show the problem parameters.
Then the next parameters are:

After that is the geomean performance and energy improvements, and grade estimate, as discussed earlier.

Hints and Advice

Things you should do for sure, and like right away:

General Advice:

Things you could do but probably shouldn’t do unless you’re a stubbornly curious person and you have time to burn:

CPU Microarchitecture Parameters

The machine we’re using is the: Intel(R) Xeon(R) CPU E5-2640 v2

It shouldn’t be necessary to know the specific CPU properties to get a good grade; however, it could be useful under some advanced optimizations, and also just for understanding your results in general.

Compute

This macine supports 256-bit wide vector instructions called “AVX”. This means that it’s possible for it to use a single AVX instruction to do 8 float multiply-accumulates per cycle (as 8 * 32-bits = 256).

Cache

The cache block size is 64 bytes. Here’s the size and associativity of the three levels of cache: