Course Project

The final component of this course is a project; there is no final exam. The topic would ideally be selected/proposed by week 5.

Quick Facts

Project Content

The goal of the project is to give you an opportunity to get hands-on experience with architecture experimentation and maybe even research.

Propose a research idea and evaluate it using any means you like. You are free to combine with ongoing research from your own studies, or with another course, provided the scope of the project implementation submitted for this course is sufficient. Please see me if this is unclear.

Report guidelines:

The final project report should document the work that you have performed and the findings. You should strive to make the presentation of your report the same quality as the papers you have read during the quarter, even if it ends up being much shorter. The paper should stand alone as well – the concepts should be understandable by your classmates without having to read additional papers (ie. including relationship to existing work). Please format the paper nicely, and organize the paper with good structure. Finally, please include a statement of work which describes how each student contributed to the project.

As you may have noticed, papers typically follow one of a few canonical structures. The one below is a reasonable approach.

Project Ideas

A few broad project ideas are below. These are meant as examples; projects that don’t fit into any of these categories are acceptable.

Accelerator for Machine Learning Kernel:

This project is straightforward, but very broad: design/evaluate a hardware accelerator for an ML algorithm of interest. There are different facets of this to think about to create an interesting problem:

The evaluation criteria could include how large, fast, and power hungry the design is as compared to some relevant baseline. You might choose a baseline to be an existing CPU/GPU code. For evaluation, you could use a custom simulator, or an analytical model. Feel free to use an FPGA if you already have some experience.

Optimization Framework

Timeloop Timeloop/Accelergy and MAESTRO are tools that allow the modeling of performance and energy based on the dataflow. These modeling tools can be enhanced in a number of ways:

CPU Tensor Cores (or other CPU-core enhancements):

Implement a tensor core for a CPU. Tensor cores have been added to GPUs to help improve their performance on high-compute-density operations. This basically boils down to a few new instructions in the ISA for doing matrix multiply tiles. One could add these to a CPU instruction set, and evaluate with gem5.

More broadly, perhaps there are other challenges for CPUs that limit their performance on eg. CNNs. Another project could be to run some CNN layers on a CPU simulator, identify the bottlenecks, and show how to improve their microarchitecture (or add other instructions).

Novel GPU Parallelization:

Generally, kernels on GPU are parallelized one-at-a-time (ie. one layer at a time across all CUDA cores). Sometimes this is inefficient because there isn’t enough work per kernel. What if two or more kernels were running at the same time on the GPU (each running its own element of a batch?)?
This could lead to more efficient execution for a variety of networks.

This could also be done for an accelerator as well!

One could also look into parallelizing other ML algorithms on GPUs like gradient-boosting tree training, or various algorithms for recommender systems like alternating least squares.

Machine learning models for microarchitecture policy optimization:

There are many microarchitectural policies within programmable architectures (CPUs/GPUs) that are 1. Complicated to design and analyze, 2. Brittle between hardware versions – requiring redesign, 3. Only optimal for particular workloads. A machine learning-based approach can potentially mitigate the above. Some example areas would be:

Evaluation: For any computer-architecture study, you can always use a generic simulator with generic workloads. For example, you could use gem5

The other option is to use the infrastructure from prior microarchitecture competitions.

  1. Championship branch prediction

  2. Data Prefetching Championship

  3. Championship Value Prediction

  4. Memory Scheduling Championship

Universal Approximation Accelerators:

Interestingly, some ML algorithms are quite efficient to compute because of how simple/regular they are. What’s bizarre though, is that if some error can be tolerated, they can be even sometimes be more efficient than an existing exact algorithm!

The NPU paper developed an accelerator which is integrated with a CPU which computes approximate versions of general functions using neural networks. They showed that a fully-connected NN-approximated code, running on their accelerator, is actually faster than precise computations for certain workloads.

However, there is still an open question on what the right ML model for automated approximate computing should be. The project here could explore different types of ML models (and model hardware implementations?) which could trade-off accuracy for performance.