Homework 1 -- Microbench Analysis

Overview

Groups: You may work in groups of up to two.

Due: Wed, Feb 1st, EOD

The goals of this homework are:

  1. Get familiar with the basics of gem5 usage and statistics
  2. Practice thinking/analysis across stack (application/ISA/microarchitecture)

Step 1: Download/Compile the microbenchmarks

This homeworks uses a few simple microbenchmarks, which you can download below.

git clone https://github.com/PolyArch/cs251a-microbench.git

Just follow the steps to compile within the repo.

Step 2: Experiments

In this step, you will change various aspects of the the CPU model or program, and observe gem5’s performance and microarchitectural behavior on the five given microbenchmarks (mm,spmv,lfsr,merge,sieve).

Configuration

Gem5 is … interesting in that there are several ways to create a configuration:

  1. Create a configuration from scratch, according to the gem5 tutorial
  2. Using the “default” script, se.py. Here you would need to search through the parameters to find the ones you want to change.
  3. Or using the “standard library”

Option 1 is the most low-level and hypothetically gets you the most intimately familiar with configuring things, though it’s not super friendly to use. Option 2, se.py, is what I normally use: it is sufficient for most things and is “good enough”. Option 3 is the latest and greatest, supposedly being both simple and flexible. However, I haven’t gotten too familiar with it yet – if you use it, let me know how it goes or if there are any problems.

Statistics

Each run of the simulator will produce a statistics file as an output – save the statistics files generated from each run. Warning: by default, gem5 will write the output file to the same folder (m5out) every time. Make sure to move your output file before each subsequent run (or create separate directories for each experiment).

Experiment Definition

Perform the following experiments. Please assume the following as the default configuration: * CPU: DerivO3CPU (the OOO core) * compiler: gcc -O3 -ffast-math -ftree-vectorize (this enables vectorization) * Frequency: 1Ghz * Memory: DDR3_1600_8x8 * 2-level cache hierarchy (64KB L1, 2MB L2) * Issue Width=8

  1. CPU model: Change the CPU model between DerivO3CPU and X86MinorCPU (OOO and inorder cores).

  2. Issue Width: Try the following configurations: Try issueWidth=2 and issueWidth=8
    • E.g. in se.py you would change “system.cpu[i].issueWidth”
    • It’s probably a good idea to make all the stages width be similar:
      • For OOO core, the fetch/decode/rename/dispatch/issue/wb/commit width should be updated to the issue width
      • For the inorder core, at least change these to the issue width: decodeInputWidth, executeInputWidth, executeIssueLimit, executeCommitLimit
      • Optional, but you might also want to scale up some other pieces of the inorder core according to hte issue width, like: fetch1FetchLimit, fetch2InputBufferSize, decodeInputBufferSize, executeMemoryIssueLimit, executeMemoryCommitLimit, executeInputBufferSize, executeMaxAccessesInMemory, executeLSQRequestsQueueSize, executeLSQTransfersQueueSize, executeLSQStoreBufferSize
  3. Clock frequency: Vary the CPU clock from 1 GHz to 4 GHz.
    • Note that changing the frequency doesn’t affect gem5’s pipeline depth, or any other microarchitecture aspect, it also won’t affect the DRAM memory clock.
  4. L2 Cache Capacity: Try the following configurations: No L2 Cache, 256KB L2, 2MB L2, 16MB L2

  5. Compiler aggressiveness: Vary the compiler optimization level between -O1 and “-O3 -ffast-math -ftree-vectorize”
    • You’ll of course have to recompile to do this. You can change the OPT paramter in the makefile.

Notes:

Step 3: Analysis

After collecting all of the data from the previous step, analyze the statistics, and graph the overall performance comparison from your experiments. Then answer the following questions, and include any other stats/graphs if they are helpful.

  1. What metric (and mean) should you use to compare the performance between different system configurations? Why?

  2. What’s more important, high issue width or OOO execution? Does it depend on the application?

  3. Did any benchmarks benefit from removing the L2 cache, why or why not?

  4. For each microbenchmark: a. Describe/characterize its a) memory regularity, b) control regularity, c) locality

    b. explain which microarchitecture parameter is it most sensitive to (in other words, what do you think the “bottleneck” is), and justify using reasoning or statistics.

  5. Pick a microbenchmark; propose one application enhancement, ISA enhancement, and microarchitecture enhancement that you think would be very effective for that microbenchmark.

Handin Instructions

Submit two files to the canvas assignment:

Note that you will need to sign up for a group on Canvas to turn in the homework. Please sign up for a group by going to the people page, then groups, then search hw1. Search for an empty group and add yourself (and optionally your partner) to it.

How we will grade this:

40 points for completing the experiments, and 10 points per question.