Homework 2 -- M4 Accelerator (Magical Matrix Multiplication Machine)

Due Date: 11:59pm Tuesday, Feb. 17, 2026

In this assignment, you will implement “M4”, a small tile accelerator, in gem5 by defining new x86 instructions, decoding them into macro‑ops, and implementing micro‑ops that do timing memory accesses. Validate correctness with a provided test, then build your own benchmark.

Goals:

Read gem5’s x86 decoder and add new instruction encodings.
Understand macro‑ops vs micro‑ops in gem5.
Implement timing‑accurate micro‑ops in C++.
Enforce dependence ordering for accelerator state.

M4 Accelerator Concept

The M4 (Magical Matrix Multiplication Machine) is a simple tile accelerator that performs 4×4 matrix multiply-accumulate operations.

Hardware State

The accelerator maintains three pieces of internal state:

A queue: A FIFO that holds 4×4 input tiles for the left operand
B queue: A FIFO that holds 4×4 input tiles for the right operand
Out: A 4×4 accumulator matrix for the result

This state exists outside the CPU’s register file, which means gem5’s normal speculation recovery mechanisms don’t know about it. You must handle this carefully (see Dependence Handling).

Instructions

You will implement four instructions to interact with this state:

LOADA: Load a 4×4 tile from memory and push it onto the A queue.
LOADB: Load a 4×4 tile from memory and push it onto the B queue. If both queues now have data, pop one tile from each and compute Out += A * B.
LOADOUT: Load a 4×4 tile from memory into Out (useful for resuming a partial computation).
STOREOUT: Store Out to memory and reset it to zeros.

Operation

When both the A and B queues are non-empty, the hardware automatically pops one tile from each and performs a matrix multiply-accumulate:

\[\text{Out}[i][j] \mathrel{+}= \sum_{k=0}^{3} A[i][k] \cdot B[k][j] \quad \text{for } i,j \in \{0,1,2,3\}\]

This check happens after each LOADA or LOADB. If Out hasn’t been initialized (no prior LOADOUT), it starts as zeros.

Queue Semantics

Pairing: When both queues are non-empty, one tile is popped from each and computation occurs.
Example: LOADA, LOADA, LOADB → first A pairs with B (compute happens), second A remains queued.
Queue depth: Unlimited for this assignment (use std::deque). In a real design, queue depth would be bounded—well-structured programs should keep the number of unmatched tiles small.
STOREOUT requirement: Both queues must be empty (all tiles paired) at the time of STOREOUT.

Memory Layout

Tiles must be stored in tile-major format so each 4×4 block is contiguous in memory:

float A[MBlocks][KBlocks][4][4];  // A[m][k] is a contiguous 64-byte tile

Within each tile, elements are in row-major order.

ISA Specification

Constraints

Tile size = 4×4 FP32 (64 bytes)
Tile memory must be 64-byte aligned
Out is zero-initialized on first use (unless LOADOUT was called)

Example Usage

float A[MBlocks][KBlocks][4][4];
float B[KBlocks][NBlocks][4][4];
float C[MBlocks][NBlocks][4][4];
...

LOADA A[0][0]   // enqueue first A tile
LOADB B[0][0]   // enqueue first B tile → triggers Out += A[0][0] * B[0][0]

LOADA A[0][1]   // enqueue second A tile
LOADB B[1][0]   // enqueue second B tile → triggers Out += A[0][1] * B[1][0]

STOREOUT C[0][0]  // write result, clear Out

You have freedom to do what you want

Everything after this point are notes on how you can implement the microarchitecture of this ISA hopefully correctly in gem5. You may ignore it and go off on your own if you like, and implement things in your own way. Either way, you should describe your microarchitecture in the report.

With great freedom comes great fun. : )

Where Things Live in gem5

Decoder: src/arch/x86/isa/decoder/x87.isa
Macro‑ops: src/arch/x86/isa/insts/x87/m4.py
Micro‑op declarations: src/arch/x86/isa/microops/m4.isa
Micro‑op C++: src/arch/x86/insts/m4_microop.hh/.cc

Useful Includes in `m4_microop.cc`

Students often miss these headers when implementing the C++ micro‑ops. The following includes are typically required:

#include "arch/x86/memhelpers.hh"   // initiateMemRead / initiateMemWrite helpers
#include "arch/x86/regs/int.hh"     // int_reg::MicroBegin for dep reg selection
#include "base/logging.hh"          // panic(), DPRINTF (if you add debug)
#include "cpu/thread_context.hh"    // ThreadContext accessors
#include "enums/MemoryMode.hh"      // enums::timing vs enums::atomic check
#include "sim/system.hh"            // system->getMemoryMode()

Step 1: Decoder (x87 escape space)

Open src/arch/x86/isa/decoder/x87.isa. All x87 opcodes are 0xD8–0xDF.

Understanding the x87.isa Decoder File

The x87.isa file defines how x86 floating-point instructions are decoded. Let’s walk through how to read it.

Structure Overview

The file is organized as a nested decode tree. Each level decodes a different field of the instruction:

format WarnUnimpl {
    0x1B: decode OPCODE_OP_BOTTOM3 default Inst::UD2() {
        0x0: decode MODRM_REG {
            ...
        }
        0x1: decode MODRM_REG {
            ...
        }
        // etc.
    }
}

Key fields being decoded:

OPCODE_OP_BOTTOM3: The bottom 3 bits of the opcode byte. Since x87 opcodes are 0xD8-0xDF, this value ranges from 0-7:
- 0x0 → opcode 0xD8
- 0x1 → opcode 0xD9
- 0x2 → opcode 0xDA
- etc.
MODRM_REG: The 3-bit reg field from the ModRM byte (bits 5:3). This is often used as an opcode extension for x87 instructions.
MODRM_MOD: The 2-bit mod field from the ModRM byte (bits 7:6). When mod=3, the instruction operates on registers; otherwise it operates on memory.
MODRM_RM: The 3-bit rm field from the ModRM byte (bits 2:0). When mod=3, this selects which register to use.

Reading an Entry

Let’s decode this entry:

0x2: decode MODRM_REG {
    ...
    0x4: decode MODRM_MOD {
        0x3: Inst::LOADA(Rq);
        default: fisub();
    }
    ...
}

This means:

0x2 → We’re in opcode 0xDA (since 0xDA & 0x7 = 2)
MODRM_REG = 0x4 → The reg field of ModRM is 4
MODRM_MOD = 0x3 → The mod field is 3 (register mode)
Inst::LOADA(Rq) → Decode as our LOADA instruction. Rq means the operand is a 64-bit integer register selected by ModRM.rm (plus rex.b extension for r8-r15).

When mod != 3, it falls through to fisub() (the original x87 instruction we’re “borrowing” encoding space from).

Decoder Mapping for M4 Instructions

Add these entries to x87.isa:

Instruction	Opcode	reg	mod	Decoder Entry
LOADA	0xDA	4	3	`Inst::LOADA(Rq)`
LOADOUT	0xDA	7	3	`Inst::LOADOUT(Rq)`
LOADB	0xDD	1	3	`Inst::LOADB(Rq)`
STOREOUT	0xDD	6	3	`Inst::STOREOUT(Rq)`

For example, to add LOADA, find the 0x2: block (opcode 0xDA), then find or add 0x4: decode MODRM_MOD, and add the 0x3: case:

0x4: decode MODRM_MOD {
    0x3: Inst::LOADA(Rq);
    default: fisub();
}

Step 2: Macro‑ops

Macro-ops define the high-level instruction that the decoder produces. Each macro-op expands into one or more micro-ops.

Create src/arch/x86/isa/insts/x87/m4.py:

microcode = """
def macroop LOADA_R
{
    .adjust_env oszIn64Override
    m4loada
};
"""

TODO: Add LOADOUT_R, LOADB_R, and STOREOUT_R following the same pattern.

What this does:

Each macroop defines an instruction variant. The _R suffix indicates it takes a register operand.
.adjust_env oszIn64Override ensures proper 64-bit operand handling.
The body contains a micro-op name (m4loada) that we’ll define next.

For reference, look at data_transfer_and_conversion/load_or_store_floating_point.py which defines similar load/store macroops like FLD_R and FST_R.

Next, add "m4" to the categories list in src/arch/x86/isa/insts/x87/__init__.py.

Step 3: Micro‑op declarations

Micro-ops are the lowest level of instruction in gem5. They map to C++ classes that implement the actual behavior.

Create src/arch/x86/isa/microops/m4.isa. Here’s the first micro-op as an example:

let {{
    # Define the m4loada micro-op
    # This class tells gem5 how to instantiate our C++ micro-op
    class M4LoadA(X86Microop):
        # Constructor - no arguments needed since the register comes from ModRM
        def __init__(self):
            pass

        # This method returns C++ code that creates the micro-op object
        def getAllocator(self, microFlags):
            # Arguments to the C++ constructor:
            #   machInst: the machine instruction (contains ModRM, etc.)
            #   macrocodeBlock: name of the containing macroop
            #   microFlags: flags like IsLastMicroop, IsFirstMicroop
            #   0: memory request flags (none needed)
            return "new X86ISA::M4LoadAMicroop(machInst, macrocodeBlock, %s, 0)" % \
                    self.microFlagsText(microFlags)

    # Register this class under the name "m4loada"
    microopClasses["m4loada"] = M4LoadA

    # TODO: Define M4LoadOut, M4LoadB, and M4StoreOut following the same pattern.
    # Each maps a micro-op name (e.g., "m4loadb") to a C++ class (e.g., M4LoadBMicroop)
}};

Include this file from src/arch/x86/isa/microops/microops.isa by adding:

##include "m4.isa"

In src/arch/x86/isa/includes.isa, add this include for the C++ declarations:

#include "arch/x86/insts/m4_microop.hh"

Step 4: Micro‑op implementation (C++)

Create src/arch/x86/insts/m4_microop.hh and src/arch/x86/insts/m4_microop.cc.

Also add the new source file to the x86 build list:

src/arch/x86/SConscript

Source('insts/m4_microop.cc', tags=['x86 isa'])

Accelerator State

You’ll need to track the accelerator state as execution progresses. At minimum you need the following, and below is an example (not strictly requred):

State for A and B tiles that have been loaded (e.g. as queues, as shown below – but feel free to use your own approach)
The running accumulator (Out)

struct M4State {
    alignas(64) float Out[4][4];     // The output accumulator
    bool outValid = false;            // Has Out been initialized?
    std::deque<std::array<float,16>> aQueue;  // Queue of loaded A tiles
    std::deque<std::array<float,16>> bQueue;  // Queue of loaded B tiles
};

For this single-threaded assignment, you can simply use a global variable:

M4State state;

Note: To support multi-threaded simulation, you could extend this by using a std::unordered_map<ContextID, M4State> to give each thread its own accelerator state.

MicroOp Classes

Defining `M4LoadAMicroop`

Start by defining the class in the correct namespace in both the header and .cc file. Here’s a complete (minimal) class definition you can drop into m4_microop.hh and then fill in the methods in m4_microop.cc:

namespace gem5 {
namespace X86ISA {

class M4LoadAMicroop : public X86MicroopBase
{
  private:
    static constexpr int NumSrcRegs = 1;
    static constexpr int NumDestRegs = 0;

    RegId m4SrcRegIdx[NumSrcRegs];
    RegId m4DestRegIdx[NumDestRegs];

    Request::FlagsType memFlags;

  public:
    M4LoadAMicroop(ExtMachInst machInst, const char *inst_mnem,
            uint64_t setFlags, Request::FlagsType mem_flags);

    Fault execute(ExecContext *xc, trace::InstRecord *traceData) const override;
    Fault initiateAcc(ExecContext *xc,
            trace::InstRecord *traceData) const override;
    Fault completeAcc(PacketPtr pkt, ExecContext *xc,
            trace::InstRecord *traceData) const override;
};

} // namespace X86ISA
} // namespace gem5

The generated decoder/micro‑op glue refers to X86ISA::M4LoadAMicroop, etc., so the namespace must match.

What is memFlags?
memFlags is a bitmask (type Request::FlagsType) carried with memory requests. In the micro‑op declaration above, we pass 0 (no special flags). You can just forward that to the memory helpers.

Constructor and Flags

Every micro-op constructor must set up the instruction’s register dependencies and flags. Let’s build this up step by step.

Basic Constructor (without dependency handling)

First, here’s a minimal constructor that just reads the pointer register:

M4LoadAMicroop::M4LoadAMicroop(ExtMachInst machInst,
        const char *inst_mnem, uint64_t setFlags,
        Request::FlagsType mem_flags) :
    X86MicroopBase(machInst, "m4loada", inst_mnem, setFlags, MemReadOp),
    memFlags(mem_flags)
{
    // Extract the register number from ModRM.rm field
    // The rex.b bit extends it to 4 bits for r8-r15 access
    const uint8_t rm = machInst.modRM.rm | (machInst.rex.b << 3);
    const RegId ptr_reg(intRegClass, rm);  // The register holding our pointer

    // Tell the base class where our register arrays are
    setRegIdxArrays(
        reinterpret_cast<RegIdArrayPtr>(&M4LoadAMicroop::m4SrcRegIdx),
        reinterpret_cast<RegIdArrayPtr>(&M4LoadAMicroop::m4DestRegIdx));

    // We only read one register: the pointer
    _numSrcRegs = 1;
    _numDestRegs = 0;
    m4SrcRegIdx[0] = ptr_reg;

    // This is a load operation
    flags[IsLoad] = 1;

    // Force serialization - simple but slow
    flags[IsSerializeBefore] = 1;
    flags[IsSerializeAfter] = 1;
}

This will work correctly but has poor performance because every M4 instruction must wait for all prior instructions to complete.

Adding Dependency Handling (Better Performance)

Instead of full serialization, we can use a hidden dependency register that chains only M4 ops together, allowing other instructions to proceed in parallel:

M4LoadAMicroop::M4LoadAMicroop(ExtMachInst machInst,
        const char *inst_mnem, uint64_t setFlags,
        Request::FlagsType mem_flags) :
    X86MicroopBase(machInst, "m4loada", inst_mnem, setFlags, MemReadOp),
    memFlags(mem_flags)
{
    const uint8_t rm = machInst.modRM.rm | (machInst.rex.b << 3);
    const RegId ptr_reg(intRegClass, rm);

    // Hidden micro-register for dependency tracking between M4 ops
    constexpr int DepRegIdx = int_reg::MicroBegin + 2;
    const RegId dep_reg(intRegClass, DepRegIdx);

    setRegIdxArrays(
        reinterpret_cast<RegIdArrayPtr>(&M4LoadAMicroop::m4SrcRegIdx),
        reinterpret_cast<RegIdArrayPtr>(&M4LoadAMicroop::m4DestRegIdx));

    // Now we read TWO registers and write ONE
    _numSrcRegs = 2;
    _numDestRegs = 1;
    _numTypedDestRegs[IntRegClass] = 1;

    m4SrcRegIdx[0] = dep_reg;   // Read dependency register (wait for previous M4 op)
    m4SrcRegIdx[1] = ptr_reg;   // Read pointer register
    m4DestRegIdx[0] = dep_reg;  // Write dependency register (next M4 op waits for us)

    flags[IsLoad] = 1;
    flags[IsNonSpeculative] = 1;  // Don't execute speculatively
    flags[IsReadBarrier] = 1;     // Order with respect to earlier memory ops
    flags[IsWriteBarrier] = 1;
}

Key additions for dependency handling:

dep_reg: A hidden micro-register that all M4 ops read and write, creating a RAW chain.
IsNonSpeculative: Prevents execution on mispredicted paths.
IsReadBarrier/IsWriteBarrier: Prevents memory reordering around this instruction.

Implementing the Execution Behavior

Each micro-op class derives from X86MicroopBase. You need to implement:

execute(...): Fallback path for atomic/simple CPU models. Does the complete operation.
initiateAcc(...): Starts a timing memory access (sends request to memory).
completeAcc(...): Called when memory responds; processes the data.

Understanding ExecContext

ExecContext is the interface that micro-ops use to interact with the CPU during execution. It provides methods to:

Read and write register values
Perform memory operations
Access the thread context for simulation state

Register access:

// Read a register operand (returns RegVal, which is uint64_t)
// The index corresponds to your src-reg array order in the constructor
RegVal addr = xc->getRegOperand(this, 1);  // e.g., m4SrcRegIdx[1] = ptr_reg

// Write a register operand
// The index corresponds to your dest-reg array order
xc->setRegOperand(this, 0, RegVal(0));  // e.g., m4DestRegIdx[0] = dep_reg

Memory operations:

std::vector<bool> byte_enable(64, true);  // 64 bytes for a 4x4 FP32 tile

// Reading memory:
std::array<uint8_t, 64> buf;
Fault fault = xc->readMem(addr, buf.data(), 64, memFlags, byte_enable);  // atomic
Fault fault = initiateMemRead(xc, traceData, addr, 64, memFlags);        // timing

// Writing memory (same call works for both atomic and timing):
Fault fault = xc->writeMem(data_ptr, 64, addr, memFlags, nullptr, byte_enable);

Timing Memory Accesses

When you mark an instruction with IsLoad or IsStore:

O3 CPU (timing mode): The instruction gets an LSQ (Load-Store Queue) entry. Instead of calling execute(), the O3 pipeline routes it through the LSQ which calls:
- initiateAcc(): Submits the memory request to the cache/memory system.
- completeAcc(): Called asynchronously when the memory response arrives.
Simple CPUs (atomic mode): The execute() method is called directly, and memory operations happen synchronously.

Note: For reads, use readMem() in execute() and initiateMemRead() in initiateAcc(). For writes, writeMem() handles both modes — it performs an atomic write in execute() and initiates a timing write in initiateAcc().

You need to implement both paths:

Fault
M4LoadAMicroop::execute(ExecContext *xc, trace::InstRecord *traceData) const
{
    // Atomic path (SimpleCPU) - do everything here
    const Addr addr = xc->getRegOperand(this, 1);
    std::array<uint8_t, 64> buf;
    std::vector<bool> byte_enable(64, true);
    Fault fault = xc->readMem(addr, buf.data(), 64, memFlags, byte_enable);
    if (fault != NoFault)
        return fault;

    // TODO: Copy buf into accelerator state, trigger computation if ready

    // Write dep_reg to create RAW dependency edge (value doesn't matter)
    xc->setRegOperand(this, 0, xc->getRegOperand(this, 0));
    return NoFault;
}

Fault
M4LoadAMicroop::initiateAcc(ExecContext *xc, trace::InstRecord *traceData) const
{
    // Timing path - just start the memory request
    const Addr addr = xc->getRegOperand(this, 1);
    return initiateMemRead(xc, traceData, addr, 64, memFlags);
}

Fault
M4LoadAMicroop::completeAcc(PacketPtr pkt, ExecContext *xc,
        trace::InstRecord *traceData) const
{
    // Timing path - memory response arrived
    const uint8_t *data = pkt->getConstPtr<uint8_t>();

    // TODO: Copy data into accelerator state, trigger computation if ready

    // Write dep_reg to create RAW dependency edge (value doesn't matter)
    xc->setRegOperand(this, 0, xc->getRegOperand(this, 0));
    return NoFault;
}

Dependence Handling (Critical)

The accelerator state (A/B queues, Out accumulator) is not in architectural registers. gem5 doesn’t automatically track dependencies on this state, so you must enforce ordering yourself.

The key correctness requirement: when you execute STOREOUT, the value stored must be the result of all preceding LOADA/LOADB pairs accumulated into Out.

Why This Is Tricky

By marking our instructions with IsLoad/IsStore, we get LSQ (Load-Store Queue) entries and go through the O3 CPU’s memory pipeline. This is necessary to use the memory system, but it also means our instructions participate in the CPU’s speculation machinery:

The LSQ can speculatively execute loads before earlier stores complete
The memory dependence predictor may reorder our operations
Branch mispredictions could cause our instructions to execute on wrong paths

Our accelerator state lives outside the CPU’s register file, so the normal squash/recovery mechanisms don’t know how to undo changes to it. The approaches below are workarounds to disable speculation for M4 instructions while still using the memory system.

Note: A cleaner architectural approach would be to give the accelerator its own memory port, completely bypassing the CPU’s LSQ. But that’s more complex to implement and beyond the scope of this assignment… unless you’d rather go that way, which is fine as well.

Approach 1: Serialize All Operations

The simplest approach is to prevent any reordering by marking instructions as serializing:

flags[IsSerializeBefore] = 1;  // Wait for all prior instructions to complete
flags[IsSerializeAfter] = 1;   // Block all subsequent instructions until done

This is correct but gives poor performance since no operations can overlap.

Approach 2: Non-Speculative + Barriers + Dependency Register (Recommended)

A better approach combines three mechanisms:

IsNonSpeculative: Prevents the instruction from executing on a mispredicted path. This ensures we never modify accelerator state speculatively.
Read/Write Barriers: IsReadBarrier and IsWriteBarrier prevent the memory dependence unit from reordering this instruction past other memory operations.
Hidden Dependency Register: Make each M4 micro-op read and write the same micro-register. This creates a RAW (read-after-write) dependency chain that forces in-order execution:
```
LOADA: reads dep_reg, writes dep_reg
LOADB: reads dep_reg, writes dep_reg  → must wait for LOADA
STOREOUT: reads dep_reg, writes dep_reg → must wait for LOADB
```

Approach 3: Queueing with Careful Ordering

The queueing mechanism naturally handles A/B pairing - computation happens when both are available. But you still need to ensure:

Tiles are enqueued in program order
STOREOUT waits for all preceding loads to complete

The flags from Approach 2 achieve this.

Inline Assembly Background

Now that we’ve implemented the M4 instructions in gem5, we need a way to use them in test programs. Since standard assemblers (like GCC’s as) don’t know our custom opcodes, we encode instructions as raw bytes:

asm volatile (".byte 0xDA, 0xE0" : : : "memory");  // LOADA with RAX

How the Bytes Are Computed

For LOADA using RAX as the pointer register:

Opcode: 0xDA
ModRM byte: We need mod=3, reg=4, rm=0 (RAX)
ModRM = (mod << 6) | (reg << 3) | rm = (3 << 6) | (4 << 3) | 0 = 0xE0

Bits:  7 6 | 5 4 3 | 2 1 0
       mod |  reg  |  rm
        1 1|  1 0 0|  0 0 0  = 0xE0

So the instruction bytes are: 0xDA, 0xE0

The rm values for common registers: RAX=0, RCX=1, RDX=2, RBX=3, RSP=4, RBP=5, RSI=6, RDI=7. Use the formula above to compute the ModRM byte for any register/instruction combination.

Inline Assembly Syntax

GCC inline assembly has this format:

asm volatile ("assembly" : outputs : inputs : clobbers);

volatile: Prevents the compiler from optimizing away or reordering this asm block.
assembly: The actual assembly code or raw bytes.
outputs: Variables written by the asm (we have none for these instructions).
inputs: Variables read by the asm. We use "r"(ptr) to put ptr in any register.
clobbers: Registers or memory modified. We use "memory" to tell the compiler this asm accesses memory.

Example using register constraints:

static inline void m4_loada(const float *ptr)
{
    asm volatile (".byte 0xDA, 0xE0" : : "a"(ptr) : "memory");  // "a" = RAX
}

static inline void m4_loadb(const float *ptr)
{
    asm volatile (".byte 0xDD, 0xCE" : : "S"(ptr) : "memory");  // "S" = RSI
}

static inline void m4_storeout(float *ptr)
{
    asm volatile (".byte 0xDD, 0xF2" : : "d"(ptr) : "memory");  // "d" = RDX
}

GCC register constraints: "a" = RAX, "b" = RBX, "c" = RCX, "d" = RDX, "S" = RSI, "D" = RDI.

Note: We only clobber "memory" because the compiler doesn’t know about our internal dependency register—that’s handled entirely at the gem5 level.

Methodology

Provided Test (m4_test.c)

Here’s a test to verify basic functionality. Save as m4_test.c:

#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/**
 * Reference implementation of 4x4 matrix multiply-accumulate.
 * Computes: Out += A * B
 * All matrices are 4x4 stored in row-major order (16 floats each).
 */
static void gemm_ref_4x4(const float *A, const float *B, float *Out)
{
    for (int i = 0; i < 4; ++i) {
        for (int j = 0; j < 4; ++j) {
            float acc = Out[i * 4 + j];
            for (int k = 0; k < 4; ++k)
                acc += A[i * 4 + k] * B[k * 4 + j];
            Out[i * 4 + j] = acc;
        }
    }
}

/**
 * LOADA: Load a 4x4 tile into the A queue.
 * Encoding: opcode=0xDA, ModRM=0xE0 (reg=4, mod=3, rm=0 for RAX)
 */
static inline void m4_loada(const float *ptr)
{
    asm volatile (".byte 0xDA, 0xE0" : : "a"(ptr) : "memory");
}

/**
 * LOADB: Load a 4x4 tile into the B queue.
 * When both A and B queues have tiles, hardware computes Out += A * B.
 * Encoding: opcode=0xDD, ModRM=0xCE (reg=1, mod=3, rm=6 for RSI)
 */
static inline void m4_loadb(const float *ptr)
{
    asm volatile (".byte 0xDD, 0xCE" : : "S"(ptr) : "memory");
}

/**
 * STOREOUT: Store the Out accumulator to memory and clear it.
 * Encoding: opcode=0xDD, ModRM=0xF2 (reg=6, mod=3, rm=2 for RDX)
 */
static inline void m4_storeout(float *ptr)
{
    asm volatile (".byte 0xDD, 0xF2" : : "d"(ptr) : "memory");
}

int main(void)
{
    /* Allocate 64-byte aligned memory for tiles (required by M4).
     * Note: aligned_alloc requires size to be a multiple of alignment. */
    const size_t bytes = 16 * sizeof(float);  // 4x4 * 4 bytes = 64 bytes
    float *A = (float *)aligned_alloc(64, bytes);
    float *B = (float *)aligned_alloc(64, bytes);
    float *Out = (float *)aligned_alloc(64, bytes);
    float *OutRef = (float *)aligned_alloc(64, bytes);
    if (!A || !B || !Out || !OutRef) {
        fprintf(stderr, "aligned_alloc failed\n");
        return 1;
    }

    /* Initialize test data */
    for (int i = 0; i < 16; ++i) {
        A[i] = (float)(i % 5) * 0.5f;       // Values: 0, 0.5, 1, 1.5, 2, 0, ...
        B[i] = (float)(i % 7) * 1.25f;      // Values: 0, 1.25, 2.5, ...
        Out[i] = 0.0f;                       // Start with zeros
        OutRef[i] = 0.0f;
    }

    /* Execute M4 instructions:
     * 1. LOADA enqueues tile A
     * 2. LOADB enqueues tile B, triggers computation: Out += A * B
     * 3. STOREOUT writes Out to memory and clears it
     */
    m4_loada(A);
    m4_loadb(B);
    m4_storeout(Out);

    /* Compute reference result */
    gemm_ref_4x4(A, B, OutRef);

    /* Compare results */
    int errors = 0;
    for (int i = 0; i < 16; ++i) {
        float diff = Out[i] - OutRef[i];
        if (diff < -1e-3f || diff > 1e-3f) {
            errors++;
            if (errors < 5) {
                fprintf(stderr, "mismatch at [%d]: got %f, expected %f\n",
                        i, Out[i], OutRef[i]);
            }
        }
    }
    printf("errors=%d\n", errors);

    free(A);
    free(B);
    free(Out);
    free(OutRef);
    return errors ? 1 : 0;
}

Build and Run

# Compile the test
gcc -O3 -o m4_test m4_test.c

# Run in gem5 with O3CPU
./build/X86/gem5.opt -re --outdir m5out/m4_test \
  configs/deprecated/example/se.py --cpu-type=O3CPU --caches \
  --cmd=./m4_test

If your implementation is correct, you should see errors=0 in the output.

Evaluation

Benchmark (you write this)

Write a tiled GEMM benchmark (don’t forget to use tile‑major layout where each 4x4 block is contiguous in memory).

Metrics: Use ops/cycle. For GEMM, ops = 2 × M × N × K (counting each multiply and add separately).

For a 4×4 tile operation: ops = 2 × 4 × 4 × 4 = 128 ops per GEMM.

The M4 accelerator performs one complete 4×4 GEMM per operation. If the accelerator could execute one GEMM every cycle with no memory latency, the theoretical peak would be 128 ops/cycle. In practice, you’ll see much lower numbers due to:

Memory latency for loading tiles
Serialization from dependence handling
Pipeline stalls from non-speculative execution

Please report ops/cycle for both the baseline vectorized code compiled with -O3, and your M4 version. Graphs are highly appreciated, consider trying some different matrix sizes so we can see the trend. Please explain what you think your bottleneck is, and try to back it up with some statistics, analysis, and reasoning.

Baseline Build Flags (vectorized)

For a fair SW baseline, compile with -O3 and ensure the compiler emits vector x86 ops. Use a conservative GCC flag set that gem5 supports:

gcc -O3 -ftree-vectorize -msse4.2 -mfpmath=sse -fopt-info-vec -o m4_bench_sw ...

Verify vectorization with -fopt-info-vec output and/or by inspecting the binary:

objdump -d -M intel m4_bench_sw | grep -E "xmm|ymm|vadd|vmul"

Measuring Kernel‑Only Time (ROI)

Use gem5’s stats reset/dump around the kernel:

Insert m5_reset_stats(0,0) right before the compute loop.
Insert m5_dump_stats(0,0) right after the compute loop.

This makes stats.txt contain only the kernel region.

To call these functions you must link against libm5.a and include gem5/m5ops.h.

Minimal steps: 1) Build the m5 library:

cd util/m5
scons build/x86/out/m5

2) Include the header in your benchmark:

#include <gem5/m5ops.h>

3) Link against the library when compiling:

gcc -O3 -I include m4_bench.c util/m5/build/x86/out/libm5.a -o m4_bench

To get just the stats of the ROI, use m5_reset_stats(0, 0); right before the ROI, and m5_dump_stats(0, 0); right afterwards. (There’s also m5_dump_reset_stats(0,0) which does both in one call.)

What to Hand In

1) Report PDF (microarchitecture decisions, gem5 changes, correctness, performance) 2) Patch file (git diff is fine) 3) Benchmark source code (tar.gz)

Tips

If TimingSimpleCPU works but O3 fails → you’re missing dependence handling or non-speculative flags.
If results are wrong, check alignment and tile layout first.
Start simple: get one LOADA→LOADB→STOREOUT sequence working before writing the full benchmark.

Common Pitfalls

Forgetting IsNonSpeculative: Without this, instructions may execute on mispredicted paths and corrupt accelerator state.
Missing dependency register: Without the RAW chain, O3 may reorder your M4 ops, causing STOREOUT to execute before loads complete.
Queue ordering bugs: If using queues, make sure tiles are enqueued in program order. A LOADB completing before its corresponding LOADA can cause mismatched A/B pairs.
Not handling both CPU paths: Remember to implement both execute() (for atomic) and initiateAcc()/completeAcc() (for timing).

Debugging Tips (gem5)

1) Add debug prints in your micro-ops

gem5 uses DPRINTF(Category, ...) for debug logging. Add a custom category for your M4 ops.

Example (in m4_microop.cc):

#include "debug/M4Accel.hh"  // Generated from DebugFlag declaration

// In your completeAcc:
DPRINTF(M4Accel, "LOADA complete: addr=%#x, aQueue size=%d\n",
        pkt->getAddr(), spad.aQueue.size());

To declare the debug flag, add to src/arch/x86/insts/SConscript:

DebugFlag('M4Accel')

2) Enable debug output when running gem5

Run gem5 with --debug-flags:

./build/X86/gem5.opt --debug-flags=M4Accel,Exec --debug-file=debug.txt \
  configs/deprecated/example/se.py --cpu-type=O3CPU --caches --cmd=./m4_test

This writes debug output to debug.txt. The Exec flag shows each instruction as it executes.

3) Best practices

Start with a small test (single 4x4) and print only a few events while debugging.
Print addresses, queue sizes, and Out sums—not entire tiles.
Use one debug flag so you can turn it on/off quickly.
Turn off debug output for performance runs (it’s slow!).

CS251a