Due: April 28th, EOD
Group size: 1–4 (recommended: 2)
The goal of this assignment is to gain hands-on experience with GPU programming and to understand the compute and memory characteristics of two important deep learning kernels: convolution and attention.
You will implement CUDA kernels for both, profile their performance, and analyze where they fall on the roofline model.
You should have received an email with account access to the course GPU machine.
Make sure /usr/local/cuda/bin is in your PATH. You are also welcome to use any
other GPU you have access to — a personal machine, Google Colab, or a cloud instance
all work fine.
For an introduction to CUDA, this is a good starting point:
https://developer.nvidia.com/blog/even-easier-introduction-cuda/
The official programming guide covers GPU features in depth:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/
To query your GPU’s parameters (peak FLOPS, memory bandwidth, SM count, etc.), build and run the NVIDIA CUDA samples:
git clone https://github.com/nvidia/cuda-samples
cd cuda-samples
git checkout cd3bc1fa8e949ca016c6396c47124fdcfd75fb4b
make -j24
./Samples/1_Utilities/deviceQuery/deviceQuery
Implement CUDA kernels for the following two workloads. For each, start with a naive implementation that gets correct results, then iteratively apply optimizations and measure the impact of each. Example optimizations to consider:
__shfl_sync for reductions within a warp without
going through shared memoryYou don’t need to try all of these — pick what seems promising for your kernel and explain what you tried and why. You are not graded on how fast your kernel is, but squeezing out higher utilization is its own reward.
Implement a CUDA kernel for direct 2D convolution with the following VGG configurations:
| Layer | Ny × Nx | Ky × Kx | Ni | Nn | Stride |
|---|---|---|---|---|---|
| Conv1 | 224×224 | 3×3 | 64 | 64 | 1 |
| Conv2 | 14×14 | 3×3 | 512 | 512 | 1 |
Pick either batch size 1 or batch size 16 and use it consistently. Batch 16 tends to be easier to get high utilization from (better weight reuse), but batch 1 is simpler and can be more representative of inference scenarios.
The data layout used in the reference implementation is:
[Ky][Kx][Nn][Ni][B][NYPAD][NXPAD][Ni] (zero-padded)[B][NYSCL][NXSCL][Nn]Implement CUDA kernels for single-head attention with head dimension D=64. This represents one head of a typical multi-head attention layer. Implement both the prefill case (S queries attending causally to S keys) and the decode case (a single new query attending to a KV cache of size C):
| Case | Sizes |
|---|---|
| Prefill | S = 4096, 65536 |
| Decode | C = 4096, 65536 |
The reference implementation provides standard_prefill, flash_prefill, and
standard_decode. You may implement either standard or flash attention for your GPU
kernel. Note that for the larger context size (S=65536), standard attention may not
fit in GPU memory — the provided flash attention implementation may serve as a useful
reference in that case.
For both kernels, you are free to change the data layout, use any CUDA features (shared memory, tensor cores, warp-level intrinsics), or restructure the computation.
A CPU reference implementation for both kernels is provided at github.com/PolyArch/cs259-miniproj-ref. Clone it, build, and run to see expected outputs:
git clone git@github.com:PolyArch/cs259-miniproj-ref.git
cd cs259-miniproj-ref
make
./conv
./attention
ncu is NVIDIA’s GPU kernel profiler (the modern replacement for the deprecated
nvprof). It collects hardware performance counters for a single kernel launch and
produces detailed breakdowns of compute throughput, memory throughput, occupancy,
and bottlenecks.
# Profile all kernels in your binary
ncu ./my_program
# Profile only a specific kernel by name
ncu --kernel-name my_kernel_name ./my_program
# Full default report (good starting point -- covers compute, memory, occupancy)
# Also generates a roofline chart when opened in ncu-ui
ncu --set full ./my_program
For more targeted queries, you can pass individual metrics with --metrics.
Some useful ones:
| Metric | What it measures |
|---|---|
dram__bytes_read.sum |
Total bytes read from DRAM (HBM) for the kernel |
dram__bytes_write.sum |
Total bytes written to DRAM (HBM) for the kernel |
l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum |
Bytes requested from L1 for global loads (before the cache) |
lts__t_bytes_equiv_l1sectormiss_pipe_lsu_mem_global_op_ld.sum |
Bytes that missed L1 and went to L2 |
sm__throughput.avg.pct_of_peak_sustained_elapsed |
Overall SM utilization as % of peak |
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed |
DRAM bandwidth utilization as % of peak |
sm__warps_active.avg.pct_of_peak_sustained_active |
Occupancy: active warps as % of the SM maximum |
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum |
FP32 fused multiply-add instructions executed (1 FMA = 2 FLOPs) |
Example:
ncu --metrics dram__bytes_read.sum,dram__bytes_write.sum,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum \
./my_program
# Save a report to a file
ncu --set full -o my_report ./my_program
# Copy the report to your local machine and open in the Nsight Compute GUI
# Download: https://developer.nvidia.com/tools-overview/nsight-compute/get-started
ncu-ui my_report.ncu-rep
A full list of available metrics and what they measure is in the Nsight Compute Kernel Profiling Guide.
The roofline model plots achieved performance (GFLOPS) against arithmetic intensity
(FLOPs/byte), with a diagonal memory-bandwidth ceiling and a horizontal compute
ceiling. A kernel below the diagonal is memory-bound; above it, compute-bound.
For a primer, see this NVIDIA blog post
or the Nsight Compute roofline documentation
(note: ncu --set full generates a roofline chart automatically when opened in ncu-ui).
The two numbers most useful for placing your kernel on the roofline are:
dram__bytes_read.sum and
dram__bytes_write.sum divided by kernel durationsmsp__sass_thread_inst_executed_op_ffma_pred_on.sum
(counts FP32 FMAs, each counting as 2 FLOPs) divided by kernel durationThe ratio of FLOPs to DRAM bytes gives you the arithmetic intensity. Compare it to the ridge point of your GPU’s roofline (peak FLOPS / peak memory bandwidth) to determine which resource is the bottleneck.
Present data for all 6 workloads (Conv1, Conv2, prefill S=4096, prefill S=65536, decode C=4096, decode C=65536). Answer the questions for each kernel type you implemented — discussing how behavior differs across the configurations within it.
What is your parallelization strategy? What problem dimensions did you map to blocks and threads? Are there limitations to this approach — for example, does it scale well with batch size, context length, or channel count?
Compute the algorithmic FLOP count for each configuration. Show your derivation.
For convolution, count multiply-add operations over the kernel and channel dimensions. For attention, account for the QK dot products and the weighted sum over values — note that prefill involves a triangular (causal) access pattern.
Report your results in GFLOPs.
What is the measured execution time for each configuration? What is the achieved GFLOPS (using your algorithmic FLOP count from Q2)?
Plot your results on the roofline model for your GPU. For each kernel configuration:
ncu (dram__bytes_read.sum,
dram__bytes_write.sum) and compute the achieved arithmetic intensity from thatHow do the theoretical and measured arithmetic intensities compare? What does the difference tell you, and what does your roofline placement say about the potential for further optimization?
What optimizations did you try? Which had the most impact, and which had little or no effect? If an optimization did not help, explain why.
Turn in via Canvas using your group submission.