Homework 1 -- Microbench Analysis

Overview

Groups: You may work in groups of up to three.

Due: Wed, Jan 29, EOD

The goals of this homework are:

Get familiar with the basics of gem5 usage and statistics
Practice thinking/analysis across stack (application/ISA/microarchitecture)

Step 1: Download/Compile the microbenchmarks

This homeworks uses a few simple microbenchmarks, which you can download below.

git clone https://github.com/PolyArch/cs251a-microbench.git

Just follow the steps to compile within the repo.

Step 2: Experiments

In this step, you will change various aspects of the the CPU model or program, and observe gem5’s performance and microarchitectural behavior on the five given microbenchmarks (mm,spmv,lfsr,merge,sieve).

Configuration

Gem5 is … interesting in that there are several ways to create a configuration:

Create a configuration from scratch, according to the gem5 tutorial
Using the “default” script, se.py. Here you would need to search through the parameters to find the ones you want to change.
Or using the “standard library”

Option 1 is the most low-level and hypothetically gets you the most intimately familiar with configuring things, though it’s not super friendly to use.

Option 2, se.py, is what I’ve used in the past… it is sufficient for most things and is “good enough”. However, it’s in the process of being depricated. It seems to work for me still though, so you shouldn’t feel bad about using it.

Option 3 is the latest and greatest. It provides a simpler interface than the raw configuration script, and is a bit more modular than the old se.py approach. It also has some facilities for automatically downloading some resources (e.g. kernel images), so in that sense its like a package manager. Here’s an example of how to use it for out-of-order cores

Statistics

Each run of the simulator will produce a statistics file as an output – save the statistics files generated from each run. (you can use --outdir="results-folder" to make it output to a different directory.

Experiment Definition

Please assume the following as the default configuration:

CPU: DerivO3CPU (the OOO core)
compiler: gcc -O3 -ffast-math -ftree-vectorize (this enables vectorization)
Frequency: 2Ghz
Memory: DDR3_1600_8x8
2-level cache hierarchy (64KB L1, 2MB L2)
Issue Width=8

Perform the following experiments:

CPU model: Change the CPU model between DerivO3CPU and X86MinorCPU – i.e. OOO and inorder cores. (1 extra experiment)
Issue Width: Try the following configurations: Try issueWidth=2 and issueWidth=8 (2 extra experiments)
- E.g. in se.py you would change “system.cpu[i].issueWidth”
- It’s probably a good idea to make all the stages width be similar:
  - For OOO core, the fetch/decode/rename/dispatch/issue/wb/commit width should be updated to the issue width (though gem5 has had issues with wbWidth, commitWidth and squashWidth being set to small values, so maybe you can leave those alone)
  - For the inorder core, at least change these to the issue width: decodeInputWidth, executeInputWidth, executeIssueLimit, executeCommitLimit
  - Optional, but you might also want to scale up some other pieces of the inorder core according to hte issue width, like: fetch1FetchLimit, fetch2InputBufferSize, decodeInputBufferSize, executeMemoryIssueLimit, executeMemoryCommitLimit, executeInputBufferSize, executeMaxAccessesInMemory, executeLSQRequestsQueueSize, executeLSQTransfersQueueSize, executeLSQStoreBufferSize
Clock frequency: Vary the clock from 1 GHz to 4 GHz. (2 extra experiments)
- Note that changing the frequency doesn’t affect gem5’s pipeline depth, or any other microarchitecture aspect, it also won’t affect the DRAM memory clock.
L2 Cache Capacity: Try the following configurations: No L2 Cache, 256KB L2, 2MB L2, 16MB L2 (3 extra experiments)
Compiler aggressiveness: Vary the compiler optimization level between -O1 and “-O3 -ffast-math -ftree-vectorize” (1 extra experiment)
- You’ll of course have to recompile to do this. You can change the OPT paramter in the makefile.

Notes:

Please consider writing a script to do these runs, so that if you make a mistake it’s easier to rerun everything later.
Just as a sanity check, you should end up with 5 * 10 gem5 runs. Each run takes approximately 1 minute or less on the datasets given, so this should be approximately an hour of runtime if fully scripted, depending on the machine.
You are free to set any other parameter however you like. Typically, more reasonable parameters will yield more interesting results.
Also feel free to do other experiments if it helps you answer the questions.

Step 3: Analysis

After collecting all of the data from the previous step, analyze the statistics, and graph the overall performance comparison from your experiments. Then answer the following questions, and include any other stats/graphs if they are helpful.

What metric (and mean) should you use to compare the performance between different system configurations? Why?
What’s more important, high issue width or OOO execution? Does it depend on the application?
Did any benchmarks benefit from removing the L2 cache, why or why not?
For each microbenchmark:

a. Describe/characterize its a) memory regularity, b) control regularity, c) locality

b. explain which microarchitecture parameter is it most sensitive to (in other words, what do you think the “bottleneck” is), and justify using reasoning or statistics.
Pick a microbenchmark besides matrix multiply; propose one application enhancement, ISA enhancement, and microarchitecture enhancement that you think would be very effective for that microbenchmark.

Handin Instructions

Submit two files to the canvas assignment:

The report PDF (not zipped or tarred)
Seperately from the report, a zip file containing the config.ini and stats.txt for at least the described experiments in gem5.

Note that you will need to sign up for a group on Canvas to turn in the homework. Please sign up for a group by going to the people page, then groups, then search hw1. Search for an empty group and add yourself, and optionally your partner(s), to it.

How we will grade this:

40 points for completing the experiments, and 10 points per question.

CS251a