Homework 2 -- M4 Accelerator (Magical Matrix Multiplication Machine)
Due Date: 11:59pm Tuesday, Feb. 17, 2026
In this assignment, you will implement “M4”, a small tile accelerator, in gem5 by defining new x86 instructions, decoding them into macro‑ops, and implementing micro‑ops that do timing memory accesses. Validate correctness with a provided test, then build your own benchmark.
Goals:
- Read gem5’s x86 decoder and add new instruction encodings.
- Understand macro‑ops vs micro‑ops in gem5.
- Implement timing‑accurate micro‑ops in C++.
- Enforce dependence ordering for accelerator state.
M4 Accelerator Concept
The M4 (Magical Matrix Multiplication Machine) is a simple tile accelerator that performs 4×4 matrix multiply-accumulate operations.
Hardware State
The accelerator maintains three pieces of internal state:
- A queue: A FIFO that holds 4×4 input tiles for the left operand
- B queue: A FIFO that holds 4×4 input tiles for the right operand
- Out: A 4×4 accumulator matrix for the result
This state exists outside the CPU’s register file, which means gem5’s normal speculation recovery mechanisms don’t know about it. You must handle this carefully (see Dependence Handling).
Instructions
You will implement four instructions to interact with this state:
- LOADA: Load a 4×4 tile from memory and push it onto the A queue.
- LOADB: Load a 4×4 tile from memory and push it onto the B queue. If both queues now have data, pop one tile from each and compute
Out += A * B. - LOADOUT: Load a 4×4 tile from memory into Out (useful for resuming a partial computation).
- STOREOUT: Store Out to memory and reset it to zeros.
Operation
When both the A and B queues are non-empty, the hardware automatically pops one tile from each and performs a matrix multiply-accumulate:
\[\text{Out}[i][j] \mathrel{+}= \sum_{k=0}^{3} A[i][k] \cdot B[k][j] \quad \text{for } i,j \in \{0,1,2,3\}\]This check happens after each LOADA or LOADB. If Out hasn’t been initialized (no prior LOADOUT), it starts as zeros.
Queue Semantics
- Pairing: When both queues are non-empty, one tile is popped from each and computation occurs.
- Example:
LOADA, LOADA, LOADB→ first A pairs with B (compute happens), second A remains queued. - Queue depth: Unlimited for this assignment (use
std::deque). In a real design, queue depth would be bounded—well-structured programs should keep the number of unmatched tiles small. - STOREOUT requirement: Both queues must be empty (all tiles paired) at the time of STOREOUT.
Memory Layout
Tiles must be stored in tile-major format so each 4×4 block is contiguous in memory:
float A[MBlocks][KBlocks][4][4]; // A[m][k] is a contiguous 64-byte tile
Within each tile, elements are in row-major order.
ISA Specification
Constraints
- Tile size = 4×4 FP32 (64 bytes)
- Tile memory must be 64-byte aligned
- Out is zero-initialized on first use (unless LOADOUT was called)
Example Usage
float A[MBlocks][KBlocks][4][4];
float B[KBlocks][NBlocks][4][4];
float C[MBlocks][NBlocks][4][4];
...
LOADA A[0][0] // enqueue first A tile
LOADB B[0][0] // enqueue first B tile → triggers Out += A[0][0] * B[0][0]
LOADA A[0][1] // enqueue second A tile
LOADB B[1][0] // enqueue second B tile → triggers Out += A[0][1] * B[1][0]
STOREOUT C[0][0] // write result, clear Out
You have freedom to do what you want
Everything after this point are notes on how you can implement the microarchitecture of this ISA hopefully correctly in gem5. You may ignore it and go off on your own if you like, and implement things in your own way. Either way, you should describe your microarchitecture in the report.
With great freedom comes great fun. : )
Where Things Live in gem5
- Decoder:
src/arch/x86/isa/decoder/x87.isa - Macro‑ops:
src/arch/x86/isa/insts/x87/m4.py - Micro‑op declarations:
src/arch/x86/isa/microops/m4.isa - Micro‑op C++:
src/arch/x86/insts/m4_microop.hh/.cc
Useful Includes in m4_microop.cc
Students often miss these headers when implementing the C++ micro‑ops. The following includes are typically required:
#include "arch/x86/memhelpers.hh" // initiateMemRead / initiateMemWrite helpers
#include "arch/x86/regs/int.hh" // int_reg::MicroBegin for dep reg selection
#include "base/logging.hh" // panic(), DPRINTF (if you add debug)
#include "cpu/thread_context.hh" // ThreadContext accessors
#include "enums/MemoryMode.hh" // enums::timing vs enums::atomic check
#include "sim/system.hh" // system->getMemoryMode()
Step 1: Decoder (x87 escape space)
Open src/arch/x86/isa/decoder/x87.isa. All x87 opcodes are 0xD8–0xDF.
Understanding the x87.isa Decoder File
The x87.isa file defines how x86 floating-point instructions are decoded. Let’s walk through how to read it.
Structure Overview
The file is organized as a nested decode tree. Each level decodes a different field of the instruction:
format WarnUnimpl {
0x1B: decode OPCODE_OP_BOTTOM3 default Inst::UD2() {
0x0: decode MODRM_REG {
...
}
0x1: decode MODRM_REG {
...
}
// etc.
}
}
Key fields being decoded:
- OPCODE_OP_BOTTOM3: The bottom 3 bits of the opcode byte. Since x87 opcodes are
0xD8-0xDF, this value ranges from 0-7:0x0→ opcode0xD80x1→ opcode0xD90x2→ opcode0xDA- etc.
-
MODRM_REG: The 3-bit
regfield from the ModRM byte (bits 5:3). This is often used as an opcode extension for x87 instructions. -
MODRM_MOD: The 2-bit
modfield from the ModRM byte (bits 7:6). Whenmod=3, the instruction operates on registers; otherwise it operates on memory. - MODRM_RM: The 3-bit
rmfield from the ModRM byte (bits 2:0). Whenmod=3, this selects which register to use.
Reading an Entry
Let’s decode this entry:
0x2: decode MODRM_REG {
...
0x4: decode MODRM_MOD {
0x3: Inst::LOADA(Rq);
default: fisub();
}
...
}
This means:
0x2→ We’re in opcode0xDA(since0xDA & 0x7 = 2)MODRM_REG = 0x4→ The reg field of ModRM is 4MODRM_MOD = 0x3→ The mod field is 3 (register mode)Inst::LOADA(Rq)→ Decode as our LOADA instruction.Rqmeans the operand is a 64-bit integer register selected byModRM.rm(plusrex.bextension for r8-r15).
When mod != 3, it falls through to fisub() (the original x87 instruction we’re “borrowing” encoding space from).
Decoder Mapping for M4 Instructions
Add these entries to x87.isa:
| Instruction | Opcode | reg | mod | Decoder Entry |
|---|---|---|---|---|
| LOADA | 0xDA | 4 | 3 | Inst::LOADA(Rq) |
| LOADOUT | 0xDA | 7 | 3 | Inst::LOADOUT(Rq) |
| LOADB | 0xDD | 1 | 3 | Inst::LOADB(Rq) |
| STOREOUT | 0xDD | 6 | 3 | Inst::STOREOUT(Rq) |
For example, to add LOADA, find the 0x2: block (opcode 0xDA), then find or add 0x4: decode MODRM_MOD, and add the 0x3: case:
0x4: decode MODRM_MOD {
0x3: Inst::LOADA(Rq);
default: fisub();
}
Step 2: Macro‑ops
Macro-ops define the high-level instruction that the decoder produces. Each macro-op expands into one or more micro-ops.
Create src/arch/x86/isa/insts/x87/m4.py:
microcode = """
def macroop LOADA_R
{
.adjust_env oszIn64Override
m4loada
};
"""
TODO: Add LOADOUT_R, LOADB_R, and STOREOUT_R following the same pattern.
What this does:
- Each
macroopdefines an instruction variant. The_Rsuffix indicates it takes a register operand. .adjust_env oszIn64Overrideensures proper 64-bit operand handling.- The body contains a micro-op name (
m4loada) that we’ll define next.
For reference, look at data_transfer_and_conversion/load_or_store_floating_point.py which defines similar load/store macroops like FLD_R and FST_R.
Next, add "m4" to the categories list in src/arch/x86/isa/insts/x87/__init__.py.
Step 3: Micro‑op declarations
Micro-ops are the lowest level of instruction in gem5. They map to C++ classes that implement the actual behavior.
Create src/arch/x86/isa/microops/m4.isa. Here’s the first micro-op as an example:
let {{
# Define the m4loada micro-op
# This class tells gem5 how to instantiate our C++ micro-op
class M4LoadA(X86Microop):
# Constructor - no arguments needed since the register comes from ModRM
def __init__(self):
pass
# This method returns C++ code that creates the micro-op object
def getAllocator(self, microFlags):
# Arguments to the C++ constructor:
# machInst: the machine instruction (contains ModRM, etc.)
# macrocodeBlock: name of the containing macroop
# microFlags: flags like IsLastMicroop, IsFirstMicroop
# 0: memory request flags (none needed)
return "new X86ISA::M4LoadAMicroop(machInst, macrocodeBlock, %s, 0)" % \
self.microFlagsText(microFlags)
# Register this class under the name "m4loada"
microopClasses["m4loada"] = M4LoadA
# TODO: Define M4LoadOut, M4LoadB, and M4StoreOut following the same pattern.
# Each maps a micro-op name (e.g., "m4loadb") to a C++ class (e.g., M4LoadBMicroop)
}};
Include this file from src/arch/x86/isa/microops/microops.isa by adding:
##include "m4.isa"
In src/arch/x86/isa/includes.isa, add this include for the C++ declarations:
#include "arch/x86/insts/m4_microop.hh"
Step 4: Micro‑op implementation (C++)
Create src/arch/x86/insts/m4_microop.hh and src/arch/x86/insts/m4_microop.cc.
Also add the new source file to the x86 build list:
src/arch/x86/SConscript
Source('insts/m4_microop.cc', tags=['x86 isa'])
Accelerator State
You’ll need to track the accelerator state as execution progresses. At minimum you need the following, and below is an example (not strictly requred):
- State for A and B tiles that have been loaded (e.g. as queues, as shown below – but feel free to use your own approach)
- The running accumulator (Out)
struct M4State {
alignas(64) float Out[4][4]; // The output accumulator
bool outValid = false; // Has Out been initialized?
std::deque<std::array<float,16>> aQueue; // Queue of loaded A tiles
std::deque<std::array<float,16>> bQueue; // Queue of loaded B tiles
};
For this single-threaded assignment, you can simply use a global variable:
M4State state;
Note: To support multi-threaded simulation, you could extend this by using a
std::unordered_map<ContextID, M4State>to give each thread its own accelerator state.
MicroOp Classes
Defining M4LoadAMicroop
Start by defining the class in the correct namespace in both the header and
.cc file. Here’s a complete (minimal) class definition you can drop into
m4_microop.hh and then fill in the methods in m4_microop.cc:
namespace gem5 {
namespace X86ISA {
class M4LoadAMicroop : public X86MicroopBase
{
private:
static constexpr int NumSrcRegs = 1;
static constexpr int NumDestRegs = 0;
RegId m4SrcRegIdx[NumSrcRegs];
RegId m4DestRegIdx[NumDestRegs];
Request::FlagsType memFlags;
public:
M4LoadAMicroop(ExtMachInst machInst, const char *inst_mnem,
uint64_t setFlags, Request::FlagsType mem_flags);
Fault execute(ExecContext *xc, trace::InstRecord *traceData) const override;
Fault initiateAcc(ExecContext *xc,
trace::InstRecord *traceData) const override;
Fault completeAcc(PacketPtr pkt, ExecContext *xc,
trace::InstRecord *traceData) const override;
};
} // namespace X86ISA
} // namespace gem5
The generated decoder/micro‑op glue refers to X86ISA::M4LoadAMicroop, etc.,
so the namespace must match.
What is memFlags?
memFlags is a bitmask (type Request::FlagsType) carried with memory requests.
In the micro‑op declaration above, we pass 0 (no special flags). You can just
forward that to the memory helpers.
Constructor and Flags
Every micro-op constructor must set up the instruction’s register dependencies and flags. Let’s build this up step by step.
Basic Constructor (without dependency handling)
First, here’s a minimal constructor that just reads the pointer register:
M4LoadAMicroop::M4LoadAMicroop(ExtMachInst machInst,
const char *inst_mnem, uint64_t setFlags,
Request::FlagsType mem_flags) :
X86MicroopBase(machInst, "m4loada", inst_mnem, setFlags, MemReadOp),
memFlags(mem_flags)
{
// Extract the register number from ModRM.rm field
// The rex.b bit extends it to 4 bits for r8-r15 access
const uint8_t rm = machInst.modRM.rm | (machInst.rex.b << 3);
const RegId ptr_reg(intRegClass, rm); // The register holding our pointer
// Tell the base class where our register arrays are
setRegIdxArrays(
reinterpret_cast<RegIdArrayPtr>(&M4LoadAMicroop::m4SrcRegIdx),
reinterpret_cast<RegIdArrayPtr>(&M4LoadAMicroop::m4DestRegIdx));
// We only read one register: the pointer
_numSrcRegs = 1;
_numDestRegs = 0;
m4SrcRegIdx[0] = ptr_reg;
// This is a load operation
flags[IsLoad] = 1;
// Force serialization - simple but slow
flags[IsSerializeBefore] = 1;
flags[IsSerializeAfter] = 1;
}
This will work correctly but has poor performance because every M4 instruction must wait for all prior instructions to complete.
Adding Dependency Handling (Better Performance)
Instead of full serialization, we can use a hidden dependency register that chains only M4 ops together, allowing other instructions to proceed in parallel:
M4LoadAMicroop::M4LoadAMicroop(ExtMachInst machInst,
const char *inst_mnem, uint64_t setFlags,
Request::FlagsType mem_flags) :
X86MicroopBase(machInst, "m4loada", inst_mnem, setFlags, MemReadOp),
memFlags(mem_flags)
{
const uint8_t rm = machInst.modRM.rm | (machInst.rex.b << 3);
const RegId ptr_reg(intRegClass, rm);
// Hidden micro-register for dependency tracking between M4 ops
constexpr int DepRegIdx = int_reg::MicroBegin + 2;
const RegId dep_reg(intRegClass, DepRegIdx);
setRegIdxArrays(
reinterpret_cast<RegIdArrayPtr>(&M4LoadAMicroop::m4SrcRegIdx),
reinterpret_cast<RegIdArrayPtr>(&M4LoadAMicroop::m4DestRegIdx));
// Now we read TWO registers and write ONE
_numSrcRegs = 2;
_numDestRegs = 1;
_numTypedDestRegs[IntRegClass] = 1;
m4SrcRegIdx[0] = dep_reg; // Read dependency register (wait for previous M4 op)
m4SrcRegIdx[1] = ptr_reg; // Read pointer register
m4DestRegIdx[0] = dep_reg; // Write dependency register (next M4 op waits for us)
flags[IsLoad] = 1;
flags[IsNonSpeculative] = 1; // Don't execute speculatively
flags[IsReadBarrier] = 1; // Order with respect to earlier memory ops
flags[IsWriteBarrier] = 1;
}
Key additions for dependency handling:
dep_reg: A hidden micro-register that all M4 ops read and write, creating a RAW chain.IsNonSpeculative: Prevents execution on mispredicted paths.IsReadBarrier/IsWriteBarrier: Prevents memory reordering around this instruction.
Implementing the Execution Behavior
Each micro-op class derives from X86MicroopBase. You need to implement:
execute(...): Fallback path for atomic/simple CPU models. Does the complete operation.initiateAcc(...): Starts a timing memory access (sends request to memory).completeAcc(...): Called when memory responds; processes the data.
Understanding ExecContext
ExecContext is the interface that micro-ops use to interact with the CPU during execution. It provides methods to:
- Read and write register values
- Perform memory operations
- Access the thread context for simulation state
Register access:
// Read a register operand (returns RegVal, which is uint64_t)
// The index corresponds to your src-reg array order in the constructor
RegVal addr = xc->getRegOperand(this, 1); // e.g., m4SrcRegIdx[1] = ptr_reg
// Write a register operand
// The index corresponds to your dest-reg array order
xc->setRegOperand(this, 0, RegVal(0)); // e.g., m4DestRegIdx[0] = dep_reg
Memory operations:
std::vector<bool> byte_enable(64, true); // 64 bytes for a 4x4 FP32 tile
// Reading memory:
std::array<uint8_t, 64> buf;
Fault fault = xc->readMem(addr, buf.data(), 64, memFlags, byte_enable); // atomic
Fault fault = initiateMemRead(xc, traceData, addr, 64, memFlags); // timing
// Writing memory (same call works for both atomic and timing):
Fault fault = xc->writeMem(data_ptr, 64, addr, memFlags, nullptr, byte_enable);
Timing Memory Accesses
When you mark an instruction with IsLoad or IsStore:
- O3 CPU (timing mode): The instruction gets an LSQ (Load-Store Queue) entry. Instead of calling
execute(), the O3 pipeline routes it through the LSQ which calls:initiateAcc(): Submits the memory request to the cache/memory system.completeAcc(): Called asynchronously when the memory response arrives.
- Simple CPUs (atomic mode): The
execute()method is called directly, and memory operations happen synchronously.
Note: For reads, use readMem() in execute() and initiateMemRead() in initiateAcc(). For writes, writeMem() handles both modes — it performs an atomic write in execute() and initiates a timing write in initiateAcc().
You need to implement both paths:
Fault
M4LoadAMicroop::execute(ExecContext *xc, trace::InstRecord *traceData) const
{
// Atomic path (SimpleCPU) - do everything here
const Addr addr = xc->getRegOperand(this, 1);
std::array<uint8_t, 64> buf;
std::vector<bool> byte_enable(64, true);
Fault fault = xc->readMem(addr, buf.data(), 64, memFlags, byte_enable);
if (fault != NoFault)
return fault;
// TODO: Copy buf into accelerator state, trigger computation if ready
// Write dep_reg to create RAW dependency edge (value doesn't matter)
xc->setRegOperand(this, 0, xc->getRegOperand(this, 0));
return NoFault;
}
Fault
M4LoadAMicroop::initiateAcc(ExecContext *xc, trace::InstRecord *traceData) const
{
// Timing path - just start the memory request
const Addr addr = xc->getRegOperand(this, 1);
return initiateMemRead(xc, traceData, addr, 64, memFlags);
}
Fault
M4LoadAMicroop::completeAcc(PacketPtr pkt, ExecContext *xc,
trace::InstRecord *traceData) const
{
// Timing path - memory response arrived
const uint8_t *data = pkt->getConstPtr<uint8_t>();
// TODO: Copy data into accelerator state, trigger computation if ready
// Write dep_reg to create RAW dependency edge (value doesn't matter)
xc->setRegOperand(this, 0, xc->getRegOperand(this, 0));
return NoFault;
}
Dependence Handling (Critical)
The accelerator state (A/B queues, Out accumulator) is not in architectural registers. gem5 doesn’t automatically track dependencies on this state, so you must enforce ordering yourself.
The key correctness requirement: when you execute STOREOUT, the value stored must be the result of all preceding LOADA/LOADB pairs accumulated into Out.
Why This Is Tricky
By marking our instructions with IsLoad/IsStore, we get LSQ (Load-Store Queue) entries and go through the O3 CPU’s memory pipeline. This is necessary to use the memory system, but it also means our instructions participate in the CPU’s speculation machinery:
- The LSQ can speculatively execute loads before earlier stores complete
- The memory dependence predictor may reorder our operations
- Branch mispredictions could cause our instructions to execute on wrong paths
Our accelerator state lives outside the CPU’s register file, so the normal squash/recovery mechanisms don’t know how to undo changes to it. The approaches below are workarounds to disable speculation for M4 instructions while still using the memory system.
Note: A cleaner architectural approach would be to give the accelerator its own memory port, completely bypassing the CPU’s LSQ. But that’s more complex to implement and beyond the scope of this assignment… unless you’d rather go that way, which is fine as well.
Approach 1: Serialize All Operations
The simplest approach is to prevent any reordering by marking instructions as serializing:
flags[IsSerializeBefore] = 1; // Wait for all prior instructions to complete
flags[IsSerializeAfter] = 1; // Block all subsequent instructions until done
This is correct but gives poor performance since no operations can overlap.
Approach 2: Non-Speculative + Barriers + Dependency Register (Recommended)
A better approach combines three mechanisms:
-
IsNonSpeculative: Prevents the instruction from executing on a mispredicted path. This ensures we never modify accelerator state speculatively. -
Read/Write Barriers:
IsReadBarrierandIsWriteBarrierprevent the memory dependence unit from reordering this instruction past other memory operations. -
Hidden Dependency Register: Make each M4 micro-op read and write the same micro-register. This creates a RAW (read-after-write) dependency chain that forces in-order execution:
LOADA: reads dep_reg, writes dep_reg LOADB: reads dep_reg, writes dep_reg → must wait for LOADA STOREOUT: reads dep_reg, writes dep_reg → must wait for LOADB
Approach 3: Queueing with Careful Ordering
The queueing mechanism naturally handles A/B pairing - computation happens when both are available. But you still need to ensure:
- Tiles are enqueued in program order
- STOREOUT waits for all preceding loads to complete
The flags from Approach 2 achieve this.
Inline Assembly Background
Now that we’ve implemented the M4 instructions in gem5, we need a way to use them in test programs. Since standard assemblers (like GCC’s as) don’t know our custom opcodes, we encode instructions as raw bytes:
asm volatile (".byte 0xDA, 0xE0" : : : "memory"); // LOADA with RAX
How the Bytes Are Computed
For LOADA using RAX as the pointer register:
- Opcode:
0xDA - ModRM byte: We need
mod=3,reg=4,rm=0(RAX) - ModRM =
(mod << 6) | (reg << 3) | rm=(3 << 6) | (4 << 3) | 0=0xE0
Bits: 7 6 | 5 4 3 | 2 1 0
mod | reg | rm
1 1| 1 0 0| 0 0 0 = 0xE0
So the instruction bytes are: 0xDA, 0xE0
The rm values for common registers: RAX=0, RCX=1, RDX=2, RBX=3, RSP=4, RBP=5, RSI=6, RDI=7. Use the formula above to compute the ModRM byte for any register/instruction combination.
Inline Assembly Syntax
GCC inline assembly has this format:
asm volatile ("assembly" : outputs : inputs : clobbers);
volatile: Prevents the compiler from optimizing away or reordering this asm block.- assembly: The actual assembly code or raw bytes.
- outputs: Variables written by the asm (we have none for these instructions).
- inputs: Variables read by the asm. We use
"r"(ptr)to putptrin any register. - clobbers: Registers or memory modified. We use
"memory"to tell the compiler this asm accesses memory.
Example using register constraints:
static inline void m4_loada(const float *ptr)
{
asm volatile (".byte 0xDA, 0xE0" : : "a"(ptr) : "memory"); // "a" = RAX
}
static inline void m4_loadb(const float *ptr)
{
asm volatile (".byte 0xDD, 0xCE" : : "S"(ptr) : "memory"); // "S" = RSI
}
static inline void m4_storeout(float *ptr)
{
asm volatile (".byte 0xDD, 0xF2" : : "d"(ptr) : "memory"); // "d" = RDX
}
GCC register constraints: "a" = RAX, "b" = RBX, "c" = RCX, "d" = RDX, "S" = RSI, "D" = RDI.
Note: We only clobber
"memory"because the compiler doesn’t know about our internal dependency register—that’s handled entirely at the gem5 level.
Methodology
Provided Test (m4_test.c)
Here’s a test to verify basic functionality. Save as m4_test.c:
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/**
* Reference implementation of 4x4 matrix multiply-accumulate.
* Computes: Out += A * B
* All matrices are 4x4 stored in row-major order (16 floats each).
*/
static void gemm_ref_4x4(const float *A, const float *B, float *Out)
{
for (int i = 0; i < 4; ++i) {
for (int j = 0; j < 4; ++j) {
float acc = Out[i * 4 + j];
for (int k = 0; k < 4; ++k)
acc += A[i * 4 + k] * B[k * 4 + j];
Out[i * 4 + j] = acc;
}
}
}
/**
* LOADA: Load a 4x4 tile into the A queue.
* Encoding: opcode=0xDA, ModRM=0xE0 (reg=4, mod=3, rm=0 for RAX)
*/
static inline void m4_loada(const float *ptr)
{
asm volatile (".byte 0xDA, 0xE0" : : "a"(ptr) : "memory");
}
/**
* LOADB: Load a 4x4 tile into the B queue.
* When both A and B queues have tiles, hardware computes Out += A * B.
* Encoding: opcode=0xDD, ModRM=0xCE (reg=1, mod=3, rm=6 for RSI)
*/
static inline void m4_loadb(const float *ptr)
{
asm volatile (".byte 0xDD, 0xCE" : : "S"(ptr) : "memory");
}
/**
* STOREOUT: Store the Out accumulator to memory and clear it.
* Encoding: opcode=0xDD, ModRM=0xF2 (reg=6, mod=3, rm=2 for RDX)
*/
static inline void m4_storeout(float *ptr)
{
asm volatile (".byte 0xDD, 0xF2" : : "d"(ptr) : "memory");
}
int main(void)
{
/* Allocate 64-byte aligned memory for tiles (required by M4).
* Note: aligned_alloc requires size to be a multiple of alignment. */
const size_t bytes = 16 * sizeof(float); // 4x4 * 4 bytes = 64 bytes
float *A = (float *)aligned_alloc(64, bytes);
float *B = (float *)aligned_alloc(64, bytes);
float *Out = (float *)aligned_alloc(64, bytes);
float *OutRef = (float *)aligned_alloc(64, bytes);
if (!A || !B || !Out || !OutRef) {
fprintf(stderr, "aligned_alloc failed\n");
return 1;
}
/* Initialize test data */
for (int i = 0; i < 16; ++i) {
A[i] = (float)(i % 5) * 0.5f; // Values: 0, 0.5, 1, 1.5, 2, 0, ...
B[i] = (float)(i % 7) * 1.25f; // Values: 0, 1.25, 2.5, ...
Out[i] = 0.0f; // Start with zeros
OutRef[i] = 0.0f;
}
/* Execute M4 instructions:
* 1. LOADA enqueues tile A
* 2. LOADB enqueues tile B, triggers computation: Out += A * B
* 3. STOREOUT writes Out to memory and clears it
*/
m4_loada(A);
m4_loadb(B);
m4_storeout(Out);
/* Compute reference result */
gemm_ref_4x4(A, B, OutRef);
/* Compare results */
int errors = 0;
for (int i = 0; i < 16; ++i) {
float diff = Out[i] - OutRef[i];
if (diff < -1e-3f || diff > 1e-3f) {
errors++;
if (errors < 5) {
fprintf(stderr, "mismatch at [%d]: got %f, expected %f\n",
i, Out[i], OutRef[i]);
}
}
}
printf("errors=%d\n", errors);
free(A);
free(B);
free(Out);
free(OutRef);
return errors ? 1 : 0;
}
Build and Run
# Compile the test
gcc -O3 -o m4_test m4_test.c
# Run in gem5 with O3CPU
./build/X86/gem5.opt -re --outdir m5out/m4_test \
configs/deprecated/example/se.py --cpu-type=O3CPU --caches \
--cmd=./m4_test
If your implementation is correct, you should see errors=0 in the output.
Evaluation
Benchmark (you write this)
Write a tiled GEMM benchmark (don’t forget to use tile‑major layout where each 4x4 block is contiguous in memory).
Metrics: Use ops/cycle. For GEMM, ops = 2 × M × N × K (counting each multiply and add separately).
For a 4×4 tile operation: ops = 2 × 4 × 4 × 4 = 128 ops per GEMM.
The M4 accelerator performs one complete 4×4 GEMM per operation. If the accelerator could execute one GEMM every cycle with no memory latency, the theoretical peak would be 128 ops/cycle. In practice, you’ll see much lower numbers due to:
- Memory latency for loading tiles
- Serialization from dependence handling
- Pipeline stalls from non-speculative execution
Please report ops/cycle for both the baseline vectorized code compiled with -O3, and your M4 version. Graphs are highly appreciated, consider trying some different matrix sizes so we can see the trend. Please explain what you think your bottleneck is, and try to back it up with some statistics, analysis, and reasoning.
Baseline Build Flags (vectorized)
For a fair SW baseline, compile with -O3 and ensure the compiler emits vector x86 ops. Use a conservative GCC flag set that gem5 supports:
gcc -O3 -ftree-vectorize -msse4.2 -mfpmath=sse -fopt-info-vec -o m4_bench_sw ...
Verify vectorization with -fopt-info-vec output and/or by inspecting the
binary:
objdump -d -M intel m4_bench_sw | grep -E "xmm|ymm|vadd|vmul"
Measuring Kernel‑Only Time (ROI)
Use gem5’s stats reset/dump around the kernel:
- Insert
m5_reset_stats(0,0)right before the compute loop. - Insert
m5_dump_stats(0,0)right after the compute loop.
This makes stats.txt contain only the kernel region.
To call these functions you must link against libm5.a and include gem5/m5ops.h.
Minimal steps: 1) Build the m5 library:
cd util/m5
scons build/x86/out/m5
2) Include the header in your benchmark:
#include <gem5/m5ops.h>
3) Link against the library when compiling:
gcc -O3 -I include m4_bench.c util/m5/build/x86/out/libm5.a -o m4_bench
To get just the stats of the ROI, use m5_reset_stats(0, 0); right before the ROI, and
m5_dump_stats(0, 0); right afterwards. (There’s also m5_dump_reset_stats(0,0) which does both in one call.)
What to Hand In
1) Report PDF (microarchitecture decisions, gem5 changes, correctness, performance)
2) Patch file (git diff is fine)
3) Benchmark source code (tar.gz)
Tips
- If TimingSimpleCPU works but O3 fails → you’re missing dependence handling or non-speculative flags.
- If results are wrong, check alignment and tile layout first.
- Start simple: get one LOADA→LOADB→STOREOUT sequence working before writing the full benchmark.
Common Pitfalls
- Forgetting
IsNonSpeculative: Without this, instructions may execute on mispredicted paths and corrupt accelerator state. - Missing dependency register: Without the RAW chain, O3 may reorder your M4 ops, causing STOREOUT to execute before loads complete.
- Queue ordering bugs: If using queues, make sure tiles are enqueued in program order. A LOADB completing before its corresponding LOADA can cause mismatched A/B pairs.
- Not handling both CPU paths: Remember to implement both
execute()(for atomic) andinitiateAcc()/completeAcc()(for timing).
Debugging Tips (gem5)
1) Add debug prints in your micro-ops
gem5 uses DPRINTF(Category, ...) for debug logging. Add a custom category for your M4 ops.
Example (in m4_microop.cc):
#include "debug/M4Accel.hh" // Generated from DebugFlag declaration
// In your completeAcc:
DPRINTF(M4Accel, "LOADA complete: addr=%#x, aQueue size=%d\n",
pkt->getAddr(), spad.aQueue.size());
To declare the debug flag, add to src/arch/x86/insts/SConscript:
DebugFlag('M4Accel')
2) Enable debug output when running gem5
Run gem5 with --debug-flags:
./build/X86/gem5.opt --debug-flags=M4Accel,Exec --debug-file=debug.txt \
configs/deprecated/example/se.py --cpu-type=O3CPU --caches --cmd=./m4_test
This writes debug output to debug.txt. The Exec flag shows each instruction as it executes.
3) Best practices
- Start with a small test (single 4x4) and print only a few events while debugging.
- Print addresses, queue sizes, and Out sums—not entire tiles.
- Use one debug flag so you can turn it on/off quickly.
- Turn off debug output for performance runs (it’s slow!).