Simten

ChaCha20 in Hardware

AES needs dedicated CPU instructions to run fast. ChaCha20 was designed to need nothing special — just addition, XOR, and bit rotation, repeated 80 times, on any hardware that exists. The irony: that simplicity makes it unusually elegant in silicon too.

Build the core quarter-round from logic gates, verify it against the RFC 7539 test vector, and see why a cipher designed to escape hardware ends up being one of the cleanest things you can build in gates.

Interactive tutorial/~8 min read/Built with Simten

Three Operations. That’s It.

ChaCha20 belongs to a family of ciphers called ARX — built entirely from Addition, Rotation, and XOR. No S-boxes, no lookup tables, no multiplication. This makes it extremely fast in software and cheap in hardware.

ADD mixes bits by carrying — a change in one bit can cascade through the entire word. XOR mixes bits independently without carries. Together, they create both local and non-local diffusion across the 32-bit word.

Try changing the input values below (decimal). The Adder produces a modular sum (wrapping at 232), while BusXor flips bits independently. Notice how the outputs are shown in hex — the same 32 bits, just a more convenient representation at this scale.

Compiling...
The Three Operations: ADD, XOR, ROTL
The entire ChaCha20 cipher is built from just these three operations on 32-bit words. Try changing a and b.

Rotation: Zero-Cost Diffusion

The third operation is left bit-rotation. In software, (x << n) | (x >> (32-n)). In hardware, it’s even simpler — you just rewire the bits. No gates, no delay, no power consumption. It’s literally free.

Rotation moves high bits to low positions and low bits to high positions, ensuring that the carry diffusion from addition spreads across the entire word. ChaCha20 uses four specific rotation amounts — 16, 12, 8, and 7 — carefully chosen to maximize diffusion after just a few rounds.

Try changing the input value below (decimal). Start with 1RotateLeft16 will output 65536 (bit 0 moved to position 16), and RotateLeft7 will output 128 (bit 0 moved to position 7). You can watch a single bit travel to its new position.

Compiling...
Rotation: The Free Operation
Left rotation rearranges bits with zero gate delay. In silicon, it's just rewiring.

Chaining: ADD → XOR → Rotate

Each of the four steps in a quarter-round chains the three operations together. Take the first step:

a += b;    // modular addition
d ^= a;    // XOR new a into d
d <<<= 16; // rotate d left by 16

The addition creates carry-chain diffusion. The XOR mixes that result into d. The rotation spreads the mixed bits across the entire word. One step alone is weak — but four steps, repeated across 20 rounds, make the output indistinguishable from random.

In the circuit below, follow the data flow from left to right: the Adder feeds the BusXor, which feeds the RotateLeft16. Change any input (decimal) and the entire chain recomputes instantly.

Compiling...
One ARX Step: ADD, XOR, Rotate
Each of the 4 steps in a quarter-round chains ADD -> XOR -> ROTL.

The Full Quarter-Round

Now we chain four ARX steps together, each feeding into the next. The quarter-round takes four 32-bit words (a, b, c, d) and thoroughly mixes them:

a += b;  d ^= a;  d <<<= 16;
c += d;  b ^= c;  b <<<= 12;
a += b;  d ^= a;  d <<<= 8;
c += d;  b ^= c;  b <<<= 7;

Notice how each step feeds the next — step 1 modifies a and d, step 2 uses the new d to modify c and b, and so on. By the end, every input bit has influenced every output bit.

The circuit below is loaded with the RFC 7539 test vector. Change any input and the four output hex displays update instantly — this entire circuit is pure combinational logic, computed in a single propagation with zero clock cycles.

Compiling...
ChaCha20 Quarter-Round
The complete quarter-round -- 4 chained ARX steps. Verified against RFC 7539 test vector.

Verified against RFC 7539 §2.1.1

In: a=0x11111111 b=0x01020304 c=0x9b8d6f43 d=0x01234567

Out: a=0xea2a92f4 b=0xcb1cf8ce c=0x4581472e d=0x5881c4bb


From Quarter-Round to Full Cipher

A full ChaCha20 block operates on a 4×4 matrix of 32-bit words — 512 bits of state. The matrix is initialized with four constants, an eight-word key, a counter, and a three-word nonce:

"expa"  "nd 3"  "2-by"  "te k"   ← constants
 key[0]  key[1]  key[2]  key[3]
 key[4]  key[5]  key[6]  key[7]
counter nonce[0] nonce[1] nonce[2]

Each double-round applies the quarter-round to four columns, then four diagonals:

// Column round
QR(0, 4,  8, 12)  QR(1, 5,  9, 13)
QR(2, 6, 10, 14)  QR(3, 7, 11, 15)

// Diagonal round
QR(0, 5, 10, 15)  QR(1, 6, 11, 12)
QR(2, 7,  8, 13)  QR(3, 4,  9, 14)

Ten double-rounds (20 rounds total, 80 quarter-rounds) produce the final state. The original input is added back word-by-word — this makes the function non-invertible, a critical property for a stream cipher.

The quarter-round you built above is the only non-trivial building block. The rest is just wiring — choosing which four words from the matrix to feed into each QR call. In an ASIC or FPGA, you’d instantiate 4 quarter-round blocks and pipeline them across 20 cycles, or unroll all 80 for single-cycle throughput at the cost of area.

Designed to Avoid Hardware — Elegant in Hardware Anyway

Daniel Bernstein designed ChaCha20 in 2008 with a specific goal: a cipher that would be fast without dedicated hardware support. AES had just been blessed with AES-NI instructions on x86, but most of the world’s devices — mobile phones, IoT chips, older ARM cores — had no such luxury. ChaCha20’s ARX primitives (add, rotate, XOR) map to cheap, universally available instructions on any architecture, with no lookup tables and no timing side channels. That’s why Cloudflare and others adopted ChaCha20-Poly1305 for TLS: it’s the cipher that works fast in software on anything.

The interesting irony is what happens when you do put it in hardware anyway. The same simplicity that makes ChaCha20 fast in software — just three operations, repeated 80 times — turns out to make it unusually elegant in gates. Four adders, four XOR trees, four barrel shifters: that’s the entire datapath. A cipher designed to escape the need for specialized hardware turns out to synthesize into some of the cleanest silicon you can build.

For comparison, AES was built for hardware from day one — and one round of AES requires a 256-entry lookup table (the S-box), Galois field multiplication, and multiple XOR trees just for the MixColumns step. That complexity is exactly why Intel had to build AES-NI into the CPU: without dedicated silicon, AES is slow and vulnerable to cache-timing attacks. ChaCha20 has no such baggage.