How TPUs Do Calculations
Inside Google’s Tensor Processing Units: a 2×2 systolic array built from logic gates. Watch matrix multiplication happen one clock cycle at a time.
Multiply and Add
Every neural network inference boils down to one operation: multiply and add. Take an incoming partial sum, multiply a data value by a weight, and add the product to the partial sum. That’s one step of a dot product. Do it across millions of weights and activations, and you get a matrix multiply — the heartbeat of deep learning.
The circuit below is purely combinational — no clock, no registers, no state. Three inputs feed the calculation: data, weight, and partialSumIn. The multiplier computes data × weight, and the adder produces partialSumIn + data × weight. The result appears instantly.
Try changing the input values and watch the result update. This single multiply-add is the atom from which we’ll build a full systolic array. Accumulation doesn’t happen inside a single unit — it happens by chaining units together, passing each one’s partial sum output into the next one’s input.
The Weight Register
In a TPU, weights are loaded before data starts flowing. Each processing element has a weight register that captures a value only when a valid bit is asserted. Once latched, the weight stays fixed for the entire computation. This is the weight-stationary approach — the weight is programmed once, then thousands of activation values stream past it, each getting multiplied by the same weight.
The circuit below shows the weight register and a multiplier. Toggle weightValid on and click Tick to latch the weight. The stored weight display captures the value. Now turn valid off and tick again — the stored weight holds steady. Change dataIn and watch the product update instantly: the stored weight multiplies whatever data arrives.
In the real TPUv1, a unified buffer loads weights into all processing elements simultaneously. The valid bit is the gating mechanism that tells each PE exactly when to capture its weight.
The Processing Element
The processing element (PE) is the fundamental building block of a systolic array. It combines everything we’ve seen so far into one unit with four parts:
- A weight register that latches a weight when the valid signal is high.
- A multiplier that computes
dataIn × storedWeight. - A registered adder that computes
partialSumIn + productand latches the result into a register —partialSumOutappears one clock cycle later. - A data pipeline register that delays
dataInby one clock cycle to producedataOut, feeding the next PE in the row.
Both partialSumOut and dataOut are registered — each takes one clock cycle. Data moves right one PE per cycle, and partial sums move down one PE per cycle. This symmetry is what makes the systolic array’s timing work: both directions have the same latency per hop.
Load a weight: toggle weightValid on and tick once. Then turn valid off and tick again. Watch partialSumOut update after the tick — it’s registered, not instant. The dataOut display shows the data value delayed by one cycle, ready to feed the next PE.
Horizontal Data Flow
The word “systolic” comes from the Greek word for the heart’s rhythmic contraction. In a systolic array, data pulses through a grid of processing elements like blood through arteries — each element receives data, processes it, and passes it along in lockstep with the clock.
Here we connect two PEs horizontally. Data enters PE0’s dataIn from the left. After one clock cycle, PE0’s pipeline register outputs the data, which feeds directly into PE1’s dataIn. Each PE has its own stored weight, so the same activation value gets multiplied by different weights as it flows across the row.
Both PEs have partialSumIn = 0 here because vertical flow hasn’t been introduced yet. Each PE independently computes 0 + data × weight. Load weights into both PEs (toggle weightValid on, tick, then toggle off). Now tick repeatedly to watch data flow from left to right. PE0 shows data × weight0, while PE1 shows data × weight1 — one cycle behind.
Vertical Partial-Sum Flow
Data flows horizontally, but partial sums flow vertically. Here we stack two PEs in a column. The top PE receives partialSumIn = 0 and computes 0 + data × weight0. That result is registered — it appears at the top PE’s partialSumOut one cycle later. The bottom PE then adds its own data × weight1 to produce the full dot product, again registered one cycle later.
This vertical flow is registered — partial sums move down one PE per clock cycle, just like data moves right one PE per cycle. This symmetry is critical for timing in real hardware. A 256-deep combinational chain couldn’t close timing at TPU clock speeds. Instead, each PE latches its partial sum into a register, giving the signal a full cycle to propagate to the next PE.
To compensate for this vertical delay, the systolic array uses staggered data injection: row 0 receives data starting at cycle 1, row 1 at cycle 2, and so on. This keeps partial sums and activations synchronized as they flow through the array. Load weights (toggle weightValid on, tick, toggle off), then tick twice to see the top result, and a third time for the bottom result.
Wavefront Control
A systolic array needs a cycle counter to orchestrate the computation. In our 2×2 array, the entire matrix multiply takes just 4 clock cycles: one cycle to load weights, then three cycles of pipelined data flow (the formula is 2N−1 for an N×N array).
The controller is built around a phase register and a set of comparators. The register holds the current cycle number. Each comparator checks whether the cycle matches a specific value and drives an LED. When the enable switch is toggled, an incrementer advances the cycle on each clock tick.
In the full systolic array, cycle 0 loads all weights into the PEs. Cycles 1–2 inject activation data into the rows. By cycle 3, the last result emerges from the pipeline, and cycle 4 signals done. Toggle the enable switch and tick to watch the cycle advance and the LEDs light up in sequence.
The Full 3×3 Systolic Array
Everything we’ve built — multiply-add units, weight registers, horizontal data pipelines, vertical partial-sum accumulation, and cycle control — comes together here. Nine processing elements are arranged in a 3×3 grid. Each PE stores one weight from matrix B. Activations from matrix A flow left to right through pipeline registers. Partial sums flow top to bottom through registered stages in each column — one PE per clock cycle.
The circuit computes C = A × B. Click Start to begin. The total latency for an N×N multiply is 3N cycles (9 for our 3×3). Where does the 3 come from?
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
| 2 | 0 | 1 |
| 0 | 2 | 0 |
| 1 | 0 | 2 |
| 5 | 4 | 7 |
| 14 | 10 | 16 |
| 23 | 16 | 25 |
- N cycles to feed data — each row of PEs receives N activation values, one per cycle.
- N−1 cycles for staggered injection — because each PE’s partial-sum output goes through a register, there’s a one-cycle propagation delay per PE vertically. Row r must start r cycles late so its activations arrive at the same time as the partial sum traveling down from the row above.
- N cycles for vertical propagation — the partial sum must travel through N registered stages (one per PE) before the final result appears at the bottom.
That’s N + (N−1) + N = 3N−1 data cycles, plus 1 for weight loading = 3N total. If the vertical partial-sum path were combinational instead of registered — meaning the entire column settles in a single cycle with no propagation delay — you wouldn’t need staggered injection or vertical wait time, and the total would drop to just 2N cycles. But a combinational chain 256 PEs deep can’t close timing at real TPU clock speeds. Registers break that critical path, trading latency for a design that actually synthesizes.
Click Start and watch. The first few cycles look quiet — no results appear. That’s pipeline fill: data is flowing rightward through pipeline registers and partial sums are building downward through registered stages, but nothing has reached the bottom of a column yet. Then results start appearing in a diagonal wavefront: C[0][0] first (shortest path), then the next anti-diagonal, and so on until C[2][2] (longest path through the array). This is exactly what you’d see on a real chip — pipeline latency, then a steady stream of results.
This is the same fundamental architecture that powers Google’s TPU. A real TPUv1 has a 256×256 systolic array — 65,536 processing elements performing 92 trillion 8-bit operations per second. The principles are identical: weights are loaded once and held stationary, activations flow right through pipeline registers, and partial sums accumulate through registered stages down each column. Staggered data injection keeps everything synchronized.
What you just watched is the same process that happens inside every TPU inference. Matrix A holds activations from the previous layer. Matrix B holds the model’s trained weights. The systolic array computes the matrix product in a pipelined wavefront — weights are loaded once, then activations stream through at full speed. The result feeds into the next layer. Repeat for every layer in the network.
The systolic design is powerful because it maximizes data reuse. Each activation value is read once from memory and multiplied by every weight in its row as it flows rightward. Each weight is loaded once and used for every activation that passes through its PE over time. This dramatically reduces memory bandwidth — the bottleneck that limits GPU performance on large language models.