Homework 3 -- LRU Approximations

Overview

Due: EOD Mar. 3rd

The goals of this homework are:

Get experience hacking some piece of gem5
Practice considering tradeoffs in cache design
Suffer through simulation with more complex workloads (useful for projects!).

As usual, you may work with a group.

Setup

In this lab, we’ll investigate some approximations of the LRU cache replacement policy that use less bits of information per-set than full-LRU.

Step 1: Implement Two New LRU Approximations:

Gem5 already implements a pseudo-LRU replacement policy based on trees (tree_plru_rp.hh), which you can check out if you like – we’ll compare against it later.

For now, try to implement two additional policies:

NMRU: Replace a random block which is not the most recently used.
NRU: Replace a block which has not been used recently, using 1-bit per block to remember if a block has been used in the recent past. See this paper for more details. You may use the version that leverages BRRIP if you want, provided you understand how it works.

This tutorial will tell you a bit about how to accomplish this within gem5.

Step 2: Beyond LRU

LRU is not the best policy for all workloads. One example is when the working set is just slightly bigger than cache capacity: imagine a simple case where you’re streaming through a vector may times. LRU will always keep the more recently used data, so by the time you get to the beginning again, the beginning of the vector is evicted.

Look through the available cache replacement policies in gem5 (gem5/src/mem/cache/replacement_policies/), and select one that is thrash resistant.

Step 3: Running With SPEC

You can test your policies on the previous microbenchmarks, but the point of this assignment is to get some experience with a “better” benchmark. Specifically, we’ll use SPEC 2017.

Follow the directions here to run SPEC 2017. Don’t wait until the last minute to figure this part out — it is probably the hardest part. Sometimes its hard to get every workload from spec working… if you can only get 5 or so working that is enough.

Step 3: Architectural exploration

Suppose you are designing the L1D and L2 data caches of a new processor, based on gem5’s out-of-order O3CPU. You’re trying to decide whether it’s worth it to implement a complex policy or higher associativity, or both.

Run the simulations of the discussed policies (NMRU,NRU,PLRU,LRU,THRASH) on the SPEC workloads, and assume you have a 64KB L1D, and a 2MB L2. Assume both L1D and L2 policies are the same, and that you are using 8-way set associativity by default. Please use these additional parameters: L1I + LRU, 2GHz frequency, 8-issue OOO core.

Now Answer the following questions:

Q1. How well do the three pseudo-LRU policies approximate LRU? Do they ever do better by accident?
Q2. Is the thrash-resistant policy ever useful? Can you explain when/why?
Q3. What matters more, more associativity or a more sophisticated policy?
Q4. What’s your overall recommendation? Performance is most important, and storage cost is a secondary consideration.

Your report should describe the configurations you simulated, and use some charts/graphs to back up your answers and analysis. Feel free to choose the experiments as appropriate, or use microbenchmarks if you like.

What to Hand In

Please turn in to Canvas separately:

A PDF of the report
A zip file containing the config.ini and stats.txt for the simulations in gem5.

How we will grade this:

50 points for completing the implementation/experiments, and 50 points for the analysis.

Citation

This homework is based on gem5 101 homework 5, but we use cooler workloads now. Thanks Zhengrong!

CS251a