Simten

Simulator Engine

Compilation pipeline, typed arrays, evaluator dispatch, tick cycle, and performance

The simulator compiles circuits into a numeric representation and evaluates them using an event-driven propagation queue. Watch it in action — source nodes fire first, then changes ripple through the circuit:

Compiling circuit...

Why this architecture

simten makes two architectural choices worth naming explicitly: a hierarchical-to-flat elaboration step before simulation, and event-driven rather than compiled-code dispatch. Both are well-established in production digital logic simulators, though the event-driven vs compiled-code choice splits the field.

Circuits are authored as hierarchical TypeScript values — composites containing sub-nodes and port connections (Circuit in packages/core/src/types/circuit.ts). Before simulation, elaboration (packages/core/src/simulator/elaboration.ts) recursively flattens the hierarchy; the result is then compiled to a NumericCircuit (packages/core/src/simulator/numeric-types.ts) — typed arrays of port indices, dependency lists, and an evaluator dispatch table. The flat numeric form is what the hot loop actually executes. The split between authoring representation and execution representation lets each one specialize: the edit IR maps directly to the user's mental model of a schematic, while the runtime IR is built around what's fast on modern CPUs.

Event-driven vs compiled-code. simten only re-evaluates gates whose inputs changed since the last tick; an event queue propagates the change set until it settles. The alternative — used by Verilator — is compiled-code style: evaluate every gate every cycle, no queue. Compiled-code wins on batch verification of a fixed netlist where steady-state throughput dominates; event-driven scales better when most of the circuit is stable at any given moment, which is the common case during interactive editing. Most commercial event-driven simulators (VCS, Questa, Xcelium) sit in the same family as simten — with vastly more sophisticated scheduling — and event-driven is the dominant style in commercial EDA for the same workload reason. The choice isn't a quality difference, it's a workload difference.

Where this pattern comes from. Event-driven logic simulation goes back to the late 1960s; the data structures simten uses (integer arrays indexed by node, an event queue, a typed dispatch table) are recognizable across the literature, with much of the optimization work — levelization, compiled hybrids — developed in the 1980s and 90s and listed below as future work. Verilator's docs make the same case for the abstraction-vs-execution split: they note that using their abstract SystemC port interface internally everywhere, rather than their packed internal representation, "would reduce performance by an order of magnitude." That's their version of the same trade — abstract representation for authoring, packed integer-indexed representation for execution. The pattern recurs in compilers — V8 parses JavaScript to an AST, lowers to bytecode, then compiles hot paths to machine code — for the same reason: a representation good for humans to author and analyze is poor for machines to execute fast.

Optimizations not yet implemented. Each item below is a known elaboration-time optimization with a concrete slot in the existing pipeline; none is required for the current target — interactive circuits up to thousands of gates — but each is a viable pickup if propagation throughput becomes a bottleneck at larger scales.

  • Levelization — topological sort at elaboration time so events propagate in dependency order without re-firing. Slots into elaboration.ts after flattening; the propagate loop walks levels in order rather than draining a queue.
  • SCC detection for feedback loops — identify strongly-connected regions during elaboration so fixed-point iteration is isolated to those regions instead of running across the whole circuit. Also in elaboration.ts.
  • Constant propagation — pre-fold nodes whose inputs are all known constants at elaboration time, eliminating their runtime dispatch entirely. New pass adjacent to flattening.
  • Dead-gate elimination — reachability sweep from outputs at elaboration time, drop unreachable nodes. Pairs naturally with constant propagation.
  • JIT codegen of straight-line dispatch — emit one propagate function per circuit via new Function(...), replacing the event-queue loop with unrolled gate evaluations. Slots into compile-circuit.ts as an alternative compilation path. Caveat: new Function is blocked under strict CSP, so this is deployment-context-dependent.

Compilation Pipeline

Click through the tabs to see a FullAdder circuit at each stage — from TypeScript source to typed arrays to running simulation:

Loading...

Elaboration

Transforms hierarchical circuits into a flat list of primitives:

  1. Flatten: Recursively expand composite nodes. Each primitive gets a path-prefixed ID (e.g., cpu.alu.adder1 for a nested adder)
  2. Stitch connections: Build port forwarding map for composite boundaries, resolve connections through nested composites via transitive closure
  3. Build dependency graph: For each connection, add target node to source node's dependents list (used for event-driven propagation)
  4. Detect recursion: Guard against circuits that instantiate themselves

Output: FlatCircuit with nodes, connections, hierarchy (for UI), topLevelInputs/Outputs.

Numeric Compilation

Converts the FlatCircuit into typed arrays for O(1) access in the hot path:

Step 1 — Node indices: Assign each node a numeric index (0 to nodeCount-1). nodeIdToIndex: Map<string, number> and indexToNodeId: string[].

Step 2 — Port indices: Count total ports, assign contiguous indices. Per-node metadata in typed arrays:

nodePortStart:   Uint32Array[nodeIdx]  → first port index for this node
nodeInputCount:  Uint8Array[nodeIdx]   → number of input ports
nodeOutputCount: Uint8Array[nodeIdx]   → number of output ports

Port layout per node: [input0, input1, ..., output0, output1, ...] starting at nodePortStart[nodeIdx].

Step 3 — Port metadata: For each port, store type information:

portIsOutput: Uint8Array[portIdx]  → 1 if output, 0 if input
portIsBus:    Uint8Array[portIdx]  → 1 if bus, 0 if bit
inputPortNames: string[portIdx]    → port name (for evaluator context)

Step 4 — Evaluator dispatch: Map each node to its evaluator function index:

primitiveTypeIndex: Uint8Array[nodeIdx] → index into EVALUATORS table

Step 5 — Dependency graph: Convert string-based dependents to numeric:

dependents: Uint32Array[]  → dependents[nodeIdx] = array of downstream node indices

Step 6 — Input sources: Pre-compute where each input port reads from:

inputSourceNode: Int32Array[portIdx]  → source node index (-1 for top-level)
inputSourcePort: Int32Array[portIdx]  → source port index into values array

This eliminates all string concatenation and Map lookups from the hot path. Reading an input becomes: values[inputSourcePort[portStart + inputIndex]].

Step 7 — Node classification:

isSourceNode:       Uint8Array[nodeIdx]  → no inputs (Constant, Switch, Input)
isStateOutputNode:  Uint8Array[nodeIdx]  → outputs depend on state (Register, DFlipFlop)
hasState:           Uint8Array[nodeIdx]  → has sequential state
readsTopLevelInput: Uint8Array[nodeIdx]  → reads from circuit boundary

Port Values

All port values stored in a single Int32Array:

interface NumericPortValues {
  values: Int32Array;       // One entry per port (bits stored as 0/1, buses as numbers)
  changed: Uint8Array;      // Per-wave change flag for the propagation queue
  initialized: Uint8Array;  // 1 once a port has been written by any evaluator
}

values defaults to 0. The initialized flag tracks whether each port has been written; change detection treats the first eval of every port as a "change" so propagation reaches downstream nodes even when an output's canonical value matches the zero default.

Event Queue

Pre-allocated ring buffer with O(1) deduplication:

class NumericEventQueue {
  queue: Uint32Array;    // Ring buffer of node indices
  pending: Uint8Array;   // pending[nodeIdx] = 1 if already in queue
  head: number;
  tail: number;
}
  • enqueue(nodeIndex): O(1), skips if pending[nodeIdx] === 1
  • dequeue(): O(1), clears pending flag
  • enqueueAll(Uint32Array): Bulk enqueue from pre-computed dependents array
  • No allocations during simulation

Evaluator System

packages/core/src/simulator/evaluators/

Dispatch Table

const EVALUATORS: (NumericEvaluator | null)[] = [];
EVALUATORS[PRIMITIVE_TYPE_INDICES.And] = evalAnd;
EVALUATORS[PRIMITIVE_TYPE_INDICES.Or] = evalOr;
// ... one entry per primitive type

Hot path lookup: EVALUATORS[primitiveTypeIndex[nodeIndex]](ctx) — single array index, no branching.

Evaluation Context

Passed to every evaluator, reused across all evaluations in a propagation pass:

interface EvalContext {
  circuit: NumericCircuit;
  values: NumericPortValues;
  state: NumericSequentialState | undefined;
  queue: NumericEventQueue;
  nodeIndex: number;      // Current node being evaluated
  portStart: number;      // First port index for this node
  inputCount: number;
  outputCount: number;
}

Helper Functions

// Read input port value (O(1) — no string ops)
function readInput(ctx: EvalContext, inputIndex: number): number {
  const sourcePortIdx = ctx.circuit.inputSourcePort[ctx.portStart + inputIndex];
  return sourcePortIdx >= 0 ? ctx.values.values[sourcePortIdx] : 0;
}

// Write output port value (O(1))
function writeOutput(ctx: EvalContext, outputIndex: number, value: number): void {
  ctx.values.values[ctx.portStart + ctx.inputCount + outputIndex] = value;
}

Evaluator Categories

ModulePrimitivesNotes
Inline (index.ts)And, Or, Not, Nand, Nor, Xor, Xnor, Buffer, Switch, Led, Input, Constant, ProbeSimple enough to inline
arithmetic.tsAdder, Subtractor, Multiplier, Comparator, Shifters, Signed variantsWidth-aware, carry/overflow detection
routing.tsMux, Decoder, Splitter, BitSlice, Concat, Combiner8to8Pure wire routing, often zero logic cost
sequential.tsDFlipFlop, Register, Screen, RasterDisplay, ConsoleRead from currentState[nodeIndex]
memory.tsROM, RAM, DualPortRAMSparse Map storage, combinational read path

Sequential evaluators only read state for their outputs — state updates happen in a separate phase.

Tick Cycle

Phase 1: Combinational Propagation

seedInitialQueue(circuit, queue);
fastPropagate(circuit, queue, values, seqState, topLevelInputs);

Seeds the queue with source nodes (no inputs), state-output nodes (Register, DFlipFlop), and nodes reading top-level inputs. Then propagates:

while queue is not empty:
  node = dequeue()
  capture old output values
  evaluator = EVALUATORS[primitiveTypeIndex[node]]
  evaluator(ctx)              // reads inputs, writes outputs
  if any output changed:
    enqueueAll(dependents[node])

Stabilizes when no more changes propagate. Throws after 10,000 iterations (unstable feedback loop).

Output snapshot captured here — this is what the API returns. It reflects the combinational computation with current state, before the clock edge.

Phase 2: Clock Update

updateClockStates(circuit, seqState);

Sets all clocks to rising edge. Currently all clocks tick simultaneously (single clock domain).

Phase 3: Sequential State Computation

updateSequentialStates(circuit, values, seqState, topLevelInputs);

For each node with hasState[nodeIdx]:

  1. Build inputs Map from current port values
  2. Build clock edges from seqState.clocks
  3. Call evaluator.updateState(inputs, currentState, clockEdges)
  4. Store result in seqState.nextState[nodeIdx]

Note: This phase still uses Map-based evaluators (not the numeric fast path). It runs once per tick per stateful node, so performance impact is proportional to the number of sequential components, not total node count.

Phase 4: State Commit

commitSequentialState(seqState);

Atomically copies nextState → currentState for all nodes. Deep-copies Map-based state (RAM/ROM). Increments cycle count. This ensures all state updates appear to happen simultaneously, matching real hardware behavior.

Phase 5: Re-propagation

seedStateOutputNodes(circuit, queue);
fastPropagate(circuit, queue, values, seqState, topLevelInputs);

Propagates the committed state through combinational logic. Only seeds state-output nodes (not source nodes — they haven't changed). This prepares internal values for the next tick's Phase 1. Not returned to the API.

Memory Handling

Initialization

Memory primitives (ROM, RAM, DualPortRAM) are initialized from two sources:

  1. Builder-embedded data: nodes: { ram: RAM({ memory: { 0: 42, 1: 100 } }) }
  2. Injected memory data: Runtime data passed via initialMemory option (e.g., compiled RISC-V binaries loaded into instruction memory)

Memory data uses glob-pattern matching for node identification: "cpu0*imem" matches "cpu0_abc_RV32I_CPU_imem_xyz".

Injected data overwrites builder-embedded data for overlapping addresses.

Storage

Memory uses Map<number, number> for sparse storage — only written addresses consume memory. A 64K address space with 100 written cells uses 100 Map entries, not 65536 array slots.

State is stored in seqState.currentState[nodeIndex] and deep-copied on commit to prevent state sharing between cycles.

Simulator API

const engine = createSimulator(flatCircuit, {
  componentLibrary,
  initialMemory,   // Map<pattern, Map<address, value>>
});

// Combinational only
engine.runCombinational();

// Sequential tick
const result = engine.tick();
// result.portValues — all port values after combinational propagation
// result.sequentialState — committed state after clock edge
// result.metrics — { phase1Evals, phase2Evals, totalEvals }

// Input control
engine.setNode(nodeId, value);

// State query
engine.getPortValues();
engine.getState();
engine.getOutput(nodeId, portName);

// Time-travel
const snapshot = engine.snapshot();
engine.restore(snapshot);
engine.reset();

Performance Characteristics

Hot Path (Zero Allocation)

The propagation loop allocates nothing:

  • Event queue: pre-allocated ring buffer
  • Port values: pre-allocated Int32Array
  • Evaluator lookup: array index
  • Input reading: array index
  • Output writing: array index
  • Change detection: compare Int32 values
  • Dependent enqueue: iterate pre-computed Uint32Array

Complexity

OperationTimeSpace
CompilationO(nodes + connections)O(nodes + ports)
Single propagationO(changed_nodes × avg_dependents)O(1) amortized
Tick (5 phases)O(total_evals)O(stateful_nodes) for Map building
Memory readO(1) Map lookup
Memory writeO(1) Map setO(written_cells)

Typical Performance

For a 114-node Snake circuit: ~50-100 evaluations per tick. For a 300-node CPU: ~200-400 evaluations per tick. The event-driven approach means only nodes whose inputs actually changed get re-evaluated.

Limitations

  • Single clock domain: All clocks tick simultaneously. Multi-clock-domain circuits are not supported (no clock dividers, no async crossings).
  • Combinational loop detection: Throws after 10,000 iterations. Intentional latches (transparent latches, async resets) are not supported.
  • Sequential state update path: Uses Map-based evaluators, not the numeric fast path. Performance scales with the number of stateful nodes, not total nodes.
  • 32-bit value range: All values stored as Int32. Bus widths up to 32 bits are supported. Wider buses require splitting.
  • No timing simulation: All combinational logic evaluates in zero simulated time. There is no gate delay model — this is a functional simulator, not a timing simulator.

On this page