The goal of the assignment is to get some experience working with data-parallel hardware and programming models, as well as the basic parallelization/locality aspects of deep neural network computations.
Due: April 28th, EOD
The recommended group size is 2 (pair programming can work well), but you can work in groups of 1-4. You’ll use Canvas’s groups feature to turn in your assignment into Canvas.
You should have gotten an email for setting up an account on tetracosa.cs.ucla.edu. Please don’t use more than 50GB, or we will run out of space!
After setting up, you may want to look at some samples from NVIDIA; these can be setup by running:
git clone https://github.com/nvidia/cuda-samples
cd cuda-samples
git checkout cd3bc1fa8e949ca016c6396c47124fdcfd75fb4b
make -j24
Also, make sure “/usr/local/cuda/bin” is in your PATH environment variable.
For a simple starting code, you can look at:
vim Samples/0_Simple/vectorAdd
You can run “deviceQuery” in the cuda samples under 1_Utilities to figure out various GPU parameters, which could help in terms of optimization or analysis of performance.
./Samples/1_Utilities/deviceQuery/deviceQuery
This is a good practice tutorial to start with, and it also has links to a number of other useful tutorials.
The official programming guide is useful for in-depth explanations of all the features.
Implement and evaluate a CUDA version of “Conv2D” convolution kernel (which is a 2D convolution + extended in the channel dimension) kernel) and “classifier” (i.e. a fully connected layer / matrix-vector multiply). Use these parameters from VGG:
Param Definitions:
You may try for either batch size = 1 or batch size = 16. My experience is that batch size 16 is easier to get high performance (better weight reuse), but harder to get competitive performance with CUDNN.
A possible starting point are the kernels the fp-diannao repo, which are simple CPU implementations based on the DianNao data-layouts (but with fp datatypes). You definitely don’t have to use this code! Also, note that this does direct convolution, which the CUDNN library basically never chooses.
You may do the following:
What was your basic parallelization strategy? Are there limitations in scaling this strategy, or consequences for throughput or latency?
What is the execution time of each kernel? (and throughput if you use batching)
What do you suspect is the limiting factor for performance of each kernel (compute,dram,scratchpad)? Where does the implementation fall on the roofline model for your particular GPU? Please graph the results.
For this, you can/should measure different bandwidths using nvprof. Eg:
nvprof -m dram_read_throughput ./vectorAdd
A good description of other b/w measurements is here: https://stackoverflow.com/questions/37732735/nvprof-option-for-bandwidth
How does the implementation compare with CUDNN? (use same batch size in CUDNN) Please graph the results.
You can use CUDNN directly.
Actually, with only minimal back-and-forth I was able
to get chatgpt to give me a working code that timed a kernel. (ask for the newer graph API)
One way to do this without messing with CUDNN is to use DeepBench from Baidu You should be able to modify the “code/kernels/conv_problems.h” or “gemm_problems.h” for whatever parameters you are using for batch size. Then compile and run the tests in the “code/nvidia” folder. (all_reduce won’t compile, but this shouldn’t be a problem)
I’ve gathered some results on the V100 machine for interesting kernel sizes, and the results are here. You may use these if you like.
What optimizations did you find most useful/least useful?
Turn in the cuda kernel as an attachment in the submission. Don’t zip everything together though, make sure source and PDF are turned in separately.