Simulator Engine
Compilation pipeline, typed arrays, evaluator dispatch, tick cycle, and performance
The simulator compiles circuits into a numeric representation and evaluates them using an event-driven propagation queue. Watch it in action — source nodes fire first, then changes ripple through the circuit:
Why this architecture
simten makes two architectural choices worth naming explicitly: a hierarchical-to-flat elaboration step before simulation, and event-driven rather than compiled-code dispatch. Both are well-established in production digital logic simulators, though the event-driven vs compiled-code choice splits the field.
Circuits are authored as hierarchical TypeScript values — composites containing sub-nodes
and port connections (Circuit in packages/core/src/types/circuit.ts). Before simulation,
elaboration (packages/core/src/simulator/elaboration.ts) recursively flattens the
hierarchy; the result is then compiled to a NumericCircuit
(packages/core/src/simulator/numeric-types.ts) — typed arrays of port indices, dependency
lists, and an evaluator dispatch table. The flat numeric form is what the hot loop actually
executes. The split between authoring representation and execution representation lets each
one specialize: the edit IR maps directly to the user's mental model of a schematic, while
the runtime IR is built around what's fast on modern CPUs.
Event-driven vs compiled-code. simten only re-evaluates gates whose inputs changed since the last tick; an event queue propagates the change set until it settles. The alternative — used by Verilator — is compiled-code style: evaluate every gate every cycle, no queue. Compiled-code wins on batch verification of a fixed netlist where steady-state throughput dominates; event-driven scales better when most of the circuit is stable at any given moment, which is the common case during interactive editing. Most commercial event-driven simulators (VCS, Questa, Xcelium) sit in the same family as simten — with vastly more sophisticated scheduling — and event-driven is the dominant style in commercial EDA for the same workload reason. The choice isn't a quality difference, it's a workload difference.
Where this pattern comes from. Event-driven logic simulation goes back to the late 1960s; the data structures simten uses (integer arrays indexed by node, an event queue, a typed dispatch table) are recognizable across the literature, with much of the optimization work — levelization, compiled hybrids — developed in the 1980s and 90s and listed below as future work. Verilator's docs make the same case for the abstraction-vs-execution split: they note that using their abstract SystemC port interface internally everywhere, rather than their packed internal representation, "would reduce performance by an order of magnitude." That's their version of the same trade — abstract representation for authoring, packed integer-indexed representation for execution. The pattern recurs in compilers — V8 parses JavaScript to an AST, lowers to bytecode, then compiles hot paths to machine code — for the same reason: a representation good for humans to author and analyze is poor for machines to execute fast.
Optimizations not yet implemented. Each item below is a known elaboration-time optimization with a concrete slot in the existing pipeline; none is required for the current target — interactive circuits up to thousands of gates — but each is a viable pickup if propagation throughput becomes a bottleneck at larger scales.
- Levelization — topological sort at elaboration time so events propagate in dependency
order without re-firing. Slots into
elaboration.tsafter flattening; the propagate loop walks levels in order rather than draining a queue. - SCC detection for feedback loops — identify strongly-connected regions during
elaboration so fixed-point iteration is isolated to those regions instead of running
across the whole circuit. Also in
elaboration.ts. - Constant propagation — pre-fold nodes whose inputs are all known constants at elaboration time, eliminating their runtime dispatch entirely. New pass adjacent to flattening.
- Dead-gate elimination — reachability sweep from outputs at elaboration time, drop unreachable nodes. Pairs naturally with constant propagation.
- JIT codegen of straight-line dispatch — emit one propagate function per circuit via
new Function(...), replacing the event-queue loop with unrolled gate evaluations. Slots intocompile-circuit.tsas an alternative compilation path. Caveat:new Functionis blocked under strict CSP, so this is deployment-context-dependent.
Compilation Pipeline
Click through the tabs to see a FullAdder circuit at each stage — from TypeScript source to typed arrays to running simulation:
Elaboration
Transforms hierarchical circuits into a flat list of primitives:
- Flatten: Recursively expand composite nodes. Each primitive gets a path-prefixed ID
(e.g.,
cpu.alu.adder1for a nested adder) - Stitch connections: Build port forwarding map for composite boundaries, resolve connections through nested composites via transitive closure
- Build dependency graph: For each connection, add target node to source node's
dependentslist (used for event-driven propagation) - Detect recursion: Guard against circuits that instantiate themselves
Output: FlatCircuit with nodes, connections, hierarchy (for UI), topLevelInputs/Outputs.
Numeric Compilation
Converts the FlatCircuit into typed arrays for O(1) access in the hot path:
Step 1 — Node indices: Assign each node a numeric index (0 to nodeCount-1).
nodeIdToIndex: Map<string, number> and indexToNodeId: string[].
Step 2 — Port indices: Count total ports, assign contiguous indices. Per-node metadata in typed arrays:
nodePortStart: Uint32Array[nodeIdx] → first port index for this node
nodeInputCount: Uint8Array[nodeIdx] → number of input ports
nodeOutputCount: Uint8Array[nodeIdx] → number of output portsPort layout per node: [input0, input1, ..., output0, output1, ...] starting at nodePortStart[nodeIdx].
Step 3 — Port metadata: For each port, store type information:
portIsOutput: Uint8Array[portIdx] → 1 if output, 0 if input
portIsBus: Uint8Array[portIdx] → 1 if bus, 0 if bit
inputPortNames: string[portIdx] → port name (for evaluator context)Step 4 — Evaluator dispatch: Map each node to its evaluator function index:
primitiveTypeIndex: Uint8Array[nodeIdx] → index into EVALUATORS tableStep 5 — Dependency graph: Convert string-based dependents to numeric:
dependents: Uint32Array[] → dependents[nodeIdx] = array of downstream node indicesStep 6 — Input sources: Pre-compute where each input port reads from:
inputSourceNode: Int32Array[portIdx] → source node index (-1 for top-level)
inputSourcePort: Int32Array[portIdx] → source port index into values arrayThis eliminates all string concatenation and Map lookups from the hot path.
Reading an input becomes: values[inputSourcePort[portStart + inputIndex]].
Step 7 — Node classification:
isSourceNode: Uint8Array[nodeIdx] → no inputs (Constant, Switch, Input)
isStateOutputNode: Uint8Array[nodeIdx] → outputs depend on state (Register, DFlipFlop)
hasState: Uint8Array[nodeIdx] → has sequential state
readsTopLevelInput: Uint8Array[nodeIdx] → reads from circuit boundaryPort Values
All port values stored in a single Int32Array:
interface NumericPortValues {
values: Int32Array; // One entry per port (bits stored as 0/1, buses as numbers)
changed: Uint8Array; // Per-wave change flag for the propagation queue
initialized: Uint8Array; // 1 once a port has been written by any evaluator
}values defaults to 0. The initialized flag tracks whether each port has been written; change detection treats the first eval of every port as a "change" so propagation reaches downstream nodes even when an output's canonical value matches the zero default.
Event Queue
Pre-allocated ring buffer with O(1) deduplication:
class NumericEventQueue {
queue: Uint32Array; // Ring buffer of node indices
pending: Uint8Array; // pending[nodeIdx] = 1 if already in queue
head: number;
tail: number;
}enqueue(nodeIndex): O(1), skips ifpending[nodeIdx] === 1dequeue(): O(1), clears pending flagenqueueAll(Uint32Array): Bulk enqueue from pre-computed dependents array- No allocations during simulation
Evaluator System
packages/core/src/simulator/evaluators/
Dispatch Table
const EVALUATORS: (NumericEvaluator | null)[] = [];
EVALUATORS[PRIMITIVE_TYPE_INDICES.And] = evalAnd;
EVALUATORS[PRIMITIVE_TYPE_INDICES.Or] = evalOr;
// ... one entry per primitive typeHot path lookup: EVALUATORS[primitiveTypeIndex[nodeIndex]](ctx) — single array index, no branching.
Evaluation Context
Passed to every evaluator, reused across all evaluations in a propagation pass:
interface EvalContext {
circuit: NumericCircuit;
values: NumericPortValues;
state: NumericSequentialState | undefined;
queue: NumericEventQueue;
nodeIndex: number; // Current node being evaluated
portStart: number; // First port index for this node
inputCount: number;
outputCount: number;
}Helper Functions
// Read input port value (O(1) — no string ops)
function readInput(ctx: EvalContext, inputIndex: number): number {
const sourcePortIdx = ctx.circuit.inputSourcePort[ctx.portStart + inputIndex];
return sourcePortIdx >= 0 ? ctx.values.values[sourcePortIdx] : 0;
}
// Write output port value (O(1))
function writeOutput(ctx: EvalContext, outputIndex: number, value: number): void {
ctx.values.values[ctx.portStart + ctx.inputCount + outputIndex] = value;
}Evaluator Categories
| Module | Primitives | Notes |
|---|---|---|
| Inline (index.ts) | And, Or, Not, Nand, Nor, Xor, Xnor, Buffer, Switch, Led, Input, Constant, Probe | Simple enough to inline |
| arithmetic.ts | Adder, Subtractor, Multiplier, Comparator, Shifters, Signed variants | Width-aware, carry/overflow detection |
| routing.ts | Mux, Decoder, Splitter, BitSlice, Concat, Combiner8to8 | Pure wire routing, often zero logic cost |
| sequential.ts | DFlipFlop, Register, Screen, RasterDisplay, Console | Read from currentState[nodeIndex] |
| memory.ts | ROM, RAM, DualPortRAM | Sparse Map storage, combinational read path |
Sequential evaluators only read state for their outputs — state updates happen in a separate phase.
Tick Cycle
Phase 1: Combinational Propagation
seedInitialQueue(circuit, queue);
fastPropagate(circuit, queue, values, seqState, topLevelInputs);Seeds the queue with source nodes (no inputs), state-output nodes (Register, DFlipFlop), and nodes reading top-level inputs. Then propagates:
while queue is not empty:
node = dequeue()
capture old output values
evaluator = EVALUATORS[primitiveTypeIndex[node]]
evaluator(ctx) // reads inputs, writes outputs
if any output changed:
enqueueAll(dependents[node])Stabilizes when no more changes propagate. Throws after 10,000 iterations (unstable feedback loop).
Output snapshot captured here — this is what the API returns. It reflects the combinational computation with current state, before the clock edge.
Phase 2: Clock Update
updateClockStates(circuit, seqState);Sets all clocks to rising edge. Currently all clocks tick simultaneously (single clock domain).
Phase 3: Sequential State Computation
updateSequentialStates(circuit, values, seqState, topLevelInputs);For each node with hasState[nodeIdx]:
- Build inputs Map from current port values
- Build clock edges from
seqState.clocks - Call
evaluator.updateState(inputs, currentState, clockEdges) - Store result in
seqState.nextState[nodeIdx]
Note: This phase still uses Map-based evaluators (not the numeric fast path). It runs once per tick per stateful node, so performance impact is proportional to the number of sequential components, not total node count.
Phase 4: State Commit
commitSequentialState(seqState);Atomically copies nextState → currentState for all nodes. Deep-copies Map-based state
(RAM/ROM). Increments cycle count. This ensures all state updates appear to happen
simultaneously, matching real hardware behavior.
Phase 5: Re-propagation
seedStateOutputNodes(circuit, queue);
fastPropagate(circuit, queue, values, seqState, topLevelInputs);Propagates the committed state through combinational logic. Only seeds state-output nodes (not source nodes — they haven't changed). This prepares internal values for the next tick's Phase 1. Not returned to the API.
Memory Handling
Initialization
Memory primitives (ROM, RAM, DualPortRAM) are initialized from two sources:
- Builder-embedded data:
nodes: { ram: RAM({ memory: { 0: 42, 1: 100 } }) } - Injected memory data: Runtime data passed via
initialMemoryoption (e.g., compiled RISC-V binaries loaded into instruction memory)
Memory data uses glob-pattern matching for node identification:
"cpu0*imem" matches "cpu0_abc_RV32I_CPU_imem_xyz".
Injected data overwrites builder-embedded data for overlapping addresses.
Storage
Memory uses Map<number, number> for sparse storage — only written addresses consume memory.
A 64K address space with 100 written cells uses 100 Map entries, not 65536 array slots.
State is stored in seqState.currentState[nodeIndex] and deep-copied on commit to prevent
state sharing between cycles.
Simulator API
const engine = createSimulator(flatCircuit, {
componentLibrary,
initialMemory, // Map<pattern, Map<address, value>>
});
// Combinational only
engine.runCombinational();
// Sequential tick
const result = engine.tick();
// result.portValues — all port values after combinational propagation
// result.sequentialState — committed state after clock edge
// result.metrics — { phase1Evals, phase2Evals, totalEvals }
// Input control
engine.setNode(nodeId, value);
// State query
engine.getPortValues();
engine.getState();
engine.getOutput(nodeId, portName);
// Time-travel
const snapshot = engine.snapshot();
engine.restore(snapshot);
engine.reset();Performance Characteristics
Hot Path (Zero Allocation)
The propagation loop allocates nothing:
- Event queue: pre-allocated ring buffer
- Port values: pre-allocated Int32Array
- Evaluator lookup: array index
- Input reading: array index
- Output writing: array index
- Change detection: compare Int32 values
- Dependent enqueue: iterate pre-computed Uint32Array
Complexity
| Operation | Time | Space |
|---|---|---|
| Compilation | O(nodes + connections) | O(nodes + ports) |
| Single propagation | O(changed_nodes × avg_dependents) | O(1) amortized |
| Tick (5 phases) | O(total_evals) | O(stateful_nodes) for Map building |
| Memory read | O(1) Map lookup | — |
| Memory write | O(1) Map set | O(written_cells) |
Typical Performance
For a 114-node Snake circuit: ~50-100 evaluations per tick. For a 300-node CPU: ~200-400 evaluations per tick. The event-driven approach means only nodes whose inputs actually changed get re-evaluated.
Limitations
- Single clock domain: All clocks tick simultaneously. Multi-clock-domain circuits are not supported (no clock dividers, no async crossings).
- Combinational loop detection: Throws after 10,000 iterations. Intentional latches (transparent latches, async resets) are not supported.
- Sequential state update path: Uses Map-based evaluators, not the numeric fast path. Performance scales with the number of stateful nodes, not total nodes.
- 32-bit value range: All values stored as Int32. Bus widths up to 32 bits are supported. Wider buses require splitting.
- No timing simulation: All combinational logic evaluates in zero simulated time. There is no gate delay model — this is a functional simulator, not a timing simulator.