Due: May 15th, 11:59pm
Teams up to 4 allowed. (you can switch teams if needed)
Performance models are extremely useful to understand and optimize real
systems, as well as to explore novel designs.
Relatively simple performance modeling modeling is possible for the
dense versions of DNN Layers layers you studied. There’s no complex data structures,
and it’s easy to reason about parallelism and reuse by breaking a large problem into
smaller versions of that problem.
The goal of the assignment is to get familiar with modeling techniques, by using models to predict performance of accelerator hardware.
This is what you’ll do in this project at a high level:
In the next sections, I’ll describe what I consider to be a default project, and a later section will describe alternatives or tweaks that are acceptable.
To get experience with modeling, we need to choose some interesting-enough accelerator, and we need to make sure we can validate the model (at least to some degree). The default project is to model your own convolution kernel implementation, running on tetracosa on the TitanV GPU.
The interestingness of this project is somewhat dependent on the optimization-level
of your code … i.e. if your code is super slow, you might be artifically bottlenecked by problems that
could be solved relatively easily in software.
I think this is not actually a real problem:
my take is by modeling the performance of your implementation, you’ll see why your code is slow,
and you’ll want to go and change it – optimizing code is fun! Feel free to tweak and
codesign your code with your model.
Models come in many forms, from mechanistic/analytical models to regression based models. By default, I strongly suggest building mechanistic model. A mechanistic model uses parameters and internal variables that have meaning corresponding to software and hardware behavior. For example, you might have a parameter that is the L2 cache bandwidth, and this is used in an equation to compute the operational-intensity on the L2 cache.
The advantage for us is that a mechanistic model can tell you something about the underlying
system, which is the purpose of this assignment.
For example, by tuning the memory bandwidth parameter, you may observe whether and to what
degree the “peak” memory bandwidth can be obtained on a particular workload.
Inputs: The input interface of the model should be quite simple. At a minimum, your model should take as inputs:
To be clear, what I mean by tile parameters is the parameters that determine
For example, you might have tile parameters that determine how to split the height/width/input-channels/output-channels to GPU cores. Those same parameters might be used to determine what data is stored in scratchpad, etc.
Outputs: As for the output, please compute:
Parameters:I also urge you to use parameters for your model that have meaning: The model should take as input some hardware specific inputs, including whatever you deem to be important. E.g.:
Here’s a good resource for finding out about our GPU. (we have a TitanV, but the V100 is quite similar, besides the DRAM bandwidth)
If you created different versions of your code in mini-proj 1, you are only required to focus on one of them. I’d pick your best / most well-rounded version.
** Please Note: It is acceptable that your code does not have to work on all problem sizes, tile sizes, or combinations of tile/problem size. Just make sure your code works on an interesting range of problem sizes and tile sizes, so you can run experiments. **
The form of this model is up to you. However, one approach is to perform a bottleneck analysis: i.e. calculate the compute bound time (execution time given a perfect memory system), and the memory bound time (execution time given perfect buffer and compute bandwidth). Then simply take the maximum of the above execution times. Note that you’ll need to do this hierarchically for every level of the memory system: L1/SPAD, L2, DRAM, thinking about the tile sizes that are mapped to each level. To do this, you’ll need to reason about what data fits in the buffer, given the tiling that you performed (either implicitly in cache, or explicitly in scratchpad).
One starting point is a “roofline” model, which is a highly optimistic model that you essentially already created in mini-proj 1. A “roofline model” just means: assume the best possible performance given the ideal-operational-intensity. Another way to say that is: compute DRAM-bound time (assuming perfect reuse) and compute-bound time (assuming perfect computation), and take the minimum.
From the basic roofline model, I suggest adding models for other levels of memory hierarchy. After that, consider capturing other sorts of microarchitectural behavior, possibly but not definitely including, and definitely not limited to:
All of that said, don’t worry about having a complex model. Your goal is to create a model with just enough complexity to have good accuracy.
You will validate your model against the execution times from real hardware as the problem size and different tiling factors change. Your report should demonstrate robustness of your model for at least a few different tiling dimensions – thus, try different combinations for tile height, width, input channels, output channels.
You may use the validation metric of your choice, e.g. mean absolute percentage error.
Your baseline for this work is the very simple “roofline” model discussed above. Your goal is to get a better performance prediction than that – hopefully much better!
For the report, please include the following, and make sure to use easy-to-read graphs in your analysis:
Explanation: Explain the model you created (e.g. some psuedo code or equations), including rationale and any required tuning. Explain any challenges faced, and how you addressed them (or would address them if they are unsolved). If your model contains arbitrary constants, please explain why they exist or what they might mean.
Validation: For validation, explain the set of kernel parameters you used to validate your model.
Then present an analysis that shows the error across some range of kernels.
You must at least show error as problem sizes change in one graph, and and as tile sizes change in another
graph. Compare the accuracy of our model to a roofline model.
Explain what parameter settings the model performs well or poorly on.
Architecture Insight: Based on your model, which hardware parameters should future GPUs change if they want to perform better on your implementation of convolution? To answer this question, perform a sensitivity analysis on some parameters.
In a separate attachment on canvas, please also include the model code in whatever language you choose. Please do not zip the PDF and source together, as this makes it hard to read on Canvas.
If you really insist on modeling matrix multiplication instead of convolution, I will allow it. However, matrix multiply is much much simpler, so you need to validate your model for a pretty wide range of matrix multiplication sizes, including very wide and very thin matrices. Follow-up with me if that doesn’t make sense.
If you are ambitious enough to want to model an accelerator other than the GPU, you can also do that.
But ideally you’d have some way to validate how accurate you are … maybe see me if you want to go this route.
I hope so. I suggest performing analysis similar to the DianNao lecture.
Even though this of course will work well, I frankly don’t think you will learn much this way. I only suggest this kind of approach if you can train a model such that the parameters have a hardware meaning.
No, but its more fun if it is, at least somewhat. A good model will be kindof close for a wider range of parameters, rather than extremely accurate on one or two parameter settings.