Simten

A RISC-V CPU That Runs C

A 5-stage pipelined processor, simulated in your browser. Write C, compile it with GCC, and watch every instruction flow through fetch, decode, execute, memory, and writeback — one clock cycle at a time.

Interactive tutorial/~10 min read/Built with Simten

Why RISC-V?

Most CPUs are black boxes. You write code, it runs, and somewhere inside a billion transistors do… something. RISC-V changes that. It’s an open instruction set — anyone can read the spec, build a processor, and understand exactly what happens when your code executes.

The base integer instruction set, RV32I, has just 47 instructions. That’s enough to run a C compiler, an operating system, or a web server. ARM has thousands. x86 has thousands more. RISC-V proves you don’t need them.

The CPU on this page implements all of RV32I using a classic 5-stage pipeline — the same architecture used in SiFive’s E31 and described in Patterson & Hennessy’s Computer Organization and Design, the standard computer architecture textbook. It has data forwarding, hazard detection, 64KB of instruction memory, 64KB of data RAM, and a UART for output.

To keep things understandable, we’ve left out a few things that a production core would have:

  • No branch predictor — we always flush on taken branches. A real core would predict branch outcomes to avoid the penalty.
  • No caches — we access memory directly. A production CPU would have L1 instruction and data caches to hide memory latency.
  • No interrupts or exceptions — real cores need these for I/O, timers, and error handling.
  • No CSRs — control and status registers for privilege levels, counters, and configuration.

Each of these could be added as an extension to the existing pipeline. The 5-stage structure doesn’t change — branch prediction adds logic to the Fetch stage, caches sit in front of memory, and interrupts add a new control path into Decode.


The 5-Stage Pipeline

A single-cycle CPU executes one instruction per clock cycle — fetch, decode, execute, memory, writeback, all at once. Simple, but slow. The clock can only tick as fast as the slowest instruction allows.

A pipelined CPU splits execution into stages, separated by registers. Each stage does one piece of the work, then passes its result to the next stage on the clock edge. Like an assembly line: while one instruction is being executed, the next is being decoded, and the one after that is being fetched.

The result: 5 instructions in flight at once. The clock can tick as fast as the slowest stage, not the slowest instruction. In practice, this roughly quintuples throughput.

IF

Fetch

The program counter (PC) holds the address of the current instruction. Each cycle, it reads the instruction from memory and increments by 4 (each instruction is 4 bytes).

The simplest stage, but it sets the pace for everything else. If a branch is taken later in the pipeline, the PC gets overwritten and the fetched instruction is flushed.

ID

Decode

The 32-bit instruction is split apart: opcode, source registers, destination register, immediate value, and function codes. The register file reads out the source values.

This is where the CPU figures out what operation to perform. A RISC-V instruction packs all of this into a fixed 32-bit format — the decode logic is just wire slicing.

EX

Execute

The ALU performs the operation — add, subtract, shift, compare, or compute a branch target address. This is where the actual computation happens.

The forwarding unit also lives here. If the previous instruction's result hasn't been written back yet, the forwarding mux grabs it directly instead of reading a stale value from the register file.

MEM

Memory

Load and store instructions access data memory here. For non-memory instructions (add, shift, branch), this stage just passes values through.

Our CPU maps memory-mapped peripherals here too: a UART for serial output and a network interface for inter-CPU communication.

WB

Write Back

The result — whether from the ALU or a memory load — is written back to the destination register in the register file. The instruction is now complete.

Five stages means five instructions are in flight simultaneously. While one instruction writes back, the next is accessing memory, the next is computing, the next is decoding, and the next is being fetched.

The Program Counter

The simplest piece of the pipeline: a register that counts up by 4 each cycle. Toggle the stall switch to freeze it — that’s what happens when a hazard is detected.

In the full CPU, the PC doesn’t always increment by 4. A mux selects between three sources: PC + 4 for sequential execution, PC + immediate for branches and JAL, or register + immediate for JALR (indirect jumps). When a branch is taken, the pipeline flushes the wrongly-fetched instructions and redirects to the target address.

Compiling...
Program Counter
Increments by 4 each clock cycle. Stall freezes the count.

Here’s the full version with the mux. Toggle branch to redirect the PC to address 0x100, jump to redirect to 0x400, or stall to freeze it entirely (like a load-use hazard). Turn them off and the PC resumes incrementing by 4 from wherever it landed.

Compiling...
PC with Next-PC Mux
Toggle stall/branch/jump to see the three PC control paths.

Pipeline Registers

Between every stage sits a pipeline register. It latches the output of one stage on the clock edge, holding it stable for the next stage to read. The flush input zeros the register — used when a branch is taken and the partially-fetched instruction must be discarded.

Compiling...
Pipeline Register
Latches data between stages. Flush clears to zero.

The ALU

The Arithmetic Logic Unit is the core of the Execute stage. RV32I needs 10 operations: add, subtract, AND, OR, XOR, shift left, shift right (logical and arithmetic), set-less-than (signed and unsigned).

All 10 operations compute in parallel on the same inputs. A multiplexer at the output picks the right result based on the ALU control signal derived from the instruction’s opcode and function fields.

Here’s a simplified 8-bit version with 4 operations. The full RV32I ALU works the same way, just wider (32 bits) and with more operations.

Compiling...
8-bit ALU
op: 00=ADD 01=SUB 10=AND 11=OR

The Hard Part: Hazards

Pipelining sounds clean in theory. In practice, instructions depend on each other. Consider:

add a0, a1, a2 // a0 = a1 + a2
sub a3, a0, a4 // a3 = a0 - a4 ← needs a0!

The sub needs the result of add, but in a pipeline, sub enters the Decode stage while add is still in Execute. The result hasn’t been written back to the register file yet.

This is a data hazard. There are two solutions:

Stalling

Freeze the pipeline for a cycle until the result is available. Simple but wastes a cycle. Our hazard unit does this for load-use hazards — when a value is being loaded from memory, it’s not ready until the Memory stage completes.

Forwarding

Don’t wait for writeback. Grab the result directly from whichever stage computed it and feed it back to the Execute stage through a forwarding mux. No stall needed — the pipeline keeps flowing.

Control hazards are the other problem. When a branch is taken, the instructions that were already fetched after it are wrong. The pipeline flushes them — replaces them with NOPs — and restarts from the branch target. That’s what the flush input on the pipeline registers does.

The Forwarding Mux

Two bits select the source: 00 = register file (no hazard), 01 = forward from EX stage, 10 = forward from MEM stage. The forwarding unit sets these bits automatically by comparing register addresses across pipeline stages.

Compiling...
Forwarding Mux
sel: 00=register 01=EX forward 10=MEM forward

The Payoff: Running C

That’s the whole CPU. A program counter, a decoder, an ALU, memory, a register file, pipeline registers between each stage, and forwarding/hazard logic to keep it all correct. Around 600 lines of circuit description — registers, muxes, and functional blocks wired together structurally, like a real hardware design.

Below is the complete board — CPU chip, instruction ROM, data RAM, memory bus, and UART — just like a real PCB. Click the Code button on the ROM node to write C, Rust, or Assembly. The compiler produces a RISC-V binary that loads directly into the ROM. Then step the clock and watch your code execute through the 5-stage pipeline.

Double-click the CPU to drill into its internals — you’ll see every pipeline register, every forwarding mux, every hazard signal. Toggle switches, inspect values, rewind time. It’s the same circuit, all the way down.

Compiling...
RV32I CPU Board
Click 'Code' on the ROM to load a program, then step the clock.

Things to Try

The debugger above isn’t a recording — it’s a live simulation. Here are some experiments that reveal how the pipeline actually works:

1. Spot a data hazard

Write two instructions where the second uses the result of the first. Step through and watch the forwarding mux kick in — the EX or MEM result gets bypassed directly instead of waiting for writeback.

int x = 5;
int y = x + 3; // needs x immediately

2. Trigger a pipeline flush

Add an if statement. When the branch is taken, watch the pipeline badges — instructions that were already fetched get flushed and replaced with NOPs.

int x = 10;
if (x > 5) {
    x = x * 2;
}

3. Compare C vs Assembly

Write something simple in C, compile it, then switch to Assembly and write the same thing by hand. Compare how many instructions GCC generates vs your hand-written version.

// C: how many instructions?
int result = (3 + 4) * 2;

4. Try a different language

Switch to Rust and compile the same Fibonacci. The disassembly will look different — different compilers make different choices — but the pipeline doesn’t care. Same stages, same forwarding, same hazards.


Now build your own

Every circuit on this page was designed in Simten — a visual circuit simulator with an AI tutor. Start from simple components or jump straight to CPU design. The AI helps you wire things up, debug issues, and understand what’s happening at every level.