Accelerator~60 nodes•Interactive tutorial • ~10 min read

AES in Hardware

AES-GCM encrypts roughly 80% of HTTPS traffic today. It’s so ubiquitous that Intel, AMD, and ARM all built dedicated silicon to accelerate it — AES-NI, a set of CPU instructions that do one full round in a single clock cycle.

Why does AES need hardware help when something like ChaCha20 doesn’t? Because AES is genuinely complex at the gate level. One round requires a 256-entry lookup table (the S-box), Galois field multiplication, and multiple layers of XOR mixing. Without dedicated silicon, it’s slow and vulnerable to cache-timing attacks.

In this post, you’ll build the three hardware operations that make up one AES round — SubBytes, XTime (the GF(2⁸) multiply at the heart of MixColumns), and the full MixColumns step — and verify each against the FIPS 197 test vectors.

SubBytes: The Lookup Table

The first step of every AES round is SubBytes: each of the 16 bytes in the state matrix is independently replaced using a fixed 256-entry lookup table called the S-box. Input byte 0x53 always maps to 0xed. Input 0x00 always maps to 0x63. The mapping is non-linear — that’s the entire point.

In hardware, this is a ROM. You present an 8-bit address, and one clock edge later you have your 8-bit substitution. That’s it. The non-linearity that makes AES cryptographically strong comes entirely from this table lookup — there’s no ALU operation that could replace it cheaply.

Try a few inputs below. The ROM is pre-loaded with the full FIPS 197 S-box. Change the decimal value (0–255) and watch the output update:

input

0x00 = 0

S-box output

→ 0x63 = 99

input

0x53 = 83

S-box output

→ 0xed = 237

input

0xff = 255

S-box output

→ 0x16 = 22

Compiling...

SubBytes: S-Box Lookup

Each byte is replaced via a 256-entry lookup table. Try 0x00 (-> 0x63), 0x53 (-> 0xed), or 0xff (-> 0x16). Pre-loaded with FIPS 197 S-box.

Why not compute it?

The S-box is derived from multiplicative inverses in GF(2⁸) followed by an affine transformation. You could compute this on the fly with a GF inverter circuit — but it would take dozens of gates and add latency to every single round. A ROM trades area for speed, which is exactly the right trade-off in a pipelined AES core. A 256×8-bit ROM is tiny in silicon terms: just 2048 bits of storage.

XTime: Multiplication in GF(2⁸)

MixColumns needs to multiply bytes together, but AES uses arithmetic in GF(2⁸) — Galois Field of 256 elements. There’s no normal multiply here. The core operation is XTime: multiplication by 2.

XTime(x) = (x << 1) XOR (bit7 ? 0x1b : 0)

Left-shift by one (multiply by 2 in the polynomial ring). If the MSB was set before the shift, the result has overflowed 8 bits — XOR with 0x1b to reduce it back. That value is the AES irreducible polynomial x⁸ + x⁴ + x³ + x + 1, which keeps the result in 8 bits while preserving the multiplicative structure of GF(2⁸). Any other reduction would break the field.

In hardware this is just a one-bit shift, a single bit-test, and one conditional XOR. The circuit below does exactly that — follow the path from the left-shifter through the Mux (controlled by bit 7) to the final XOR.

Try these values (decimal):

input

87 (0x57)

bit7=0 — no reduction

output

174 (0xae)

input

128 (0x80)

bit7=1 — reduction XOR 0x1b

output

27 (0x1b)

input

149 (0x95)

bit7=1 — reduction XOR 0x1b

output

53 (0x35)

Compiling...

XTime: Multiply by 2 in GF(2^8)

Left-shift, then XOR with 0x1b if the MSB was 1. Try 87 (0x57 -> 0xae), 128 (0x80 -> 0x1b), or 149 (0x95 -> 0x35).

MixColumns: The Diffusion Layer

MixColumns treats each 4-byte column of the AES state matrix as a polynomial over GF(2⁸) and multiplies it by a fixed matrix. Each output byte mixes all four input bytes:

r0 = 2·s0 ⊕ 3·s1 ⊕ s2   ⊕ s3
r1 = s0   ⊕ 2·s1 ⊕ 3·s2 ⊕ s3
r2 = s0   ⊕ s1   ⊕ 2·s2 ⊕ 3·s3
r3 = 3·s0 ⊕ s1   ⊕ s2   ⊕ 2·s3

Where 2·x is XTime(x) and 3·x is XTime(x) ⊕ x. That’s it — four XTime operations and twelve XOR gates per column. Four columns = sixteen XTimes and forty-eight XORs for the full MixColumns step.

The goal of this matrix is total diffusion: every output byte depends on every input byte. Change one input bit, and all four outputs change. That’s what makes AES cryptographically strong after just a few rounds — SubBytes provides non-linearity, MixColumns ensures that non-linearity spreads everywhere.

The circuit below is verified against FIPS 197, Appendix B. The inputs are pre-loaded with the official test vector — you can change them to explore other values:

FIPS 197 Test Vector

Input column

s0 = 0xdb (219)

s1 = 0x13 (19)

s2 = 0x53 (83)

s3 = 0x45 (69)

Expected output

r0 = 0x8e (142)

r1 = 0x4d (77)

r2 = 0xa1 (161)

r3 = 0xbc (188)

Compiling...

MixColumns: One Column

FIPS 197 test vector: [0xdb, 0x13, 0x53, 0x45] -> [0x8e, 0x4d, 0xa1, 0xbc]. Four bytes in, four bytes out, completely mixed.

Why AES-NI Exists

One AES-128 round applies all four steps — SubBytes, ShiftRows, MixColumns, AddRoundKey — to a 128-bit state. Then you do it nine more times, plus a final round. In software on a general-purpose CPU, each round requires:

–16 table lookups for SubBytes (memory dependent, cache-pressure)
–16 byte-level rotations for ShiftRows
–64 GF(2⁸) multiplications and 48 XORs for MixColumns
–16 XORs for AddRoundKey

That’s hundreds of operations per round, ten rounds per block, and a timing side channel if the S-box lookups hit different cache lines for different key bytes. A table-based software AES is fast enough, but it’s not safe enough without constant-time mitigation.

What AES-NI Actually Does

Intel’s AESENC instruction, introduced in 2010, performs one full AES round in a single clock cycle. The S-box becomes a hardware lookup with no memory traffic. ShiftRows is just wiring. MixColumns uses dedicated GF(2⁸) multiplier circuits. The entire datapath is constant-time by construction — there are no branches, no memory accesses, no cache lines to leak.

Modern AES-NI pipelines the ten rounds with 3–4 cycle latency, achieving 1–2 clock cycles per byte of throughput. On a 3 GHz core that’s roughly 10–20 GB/s — fast enough that TLS overhead is effectively zero on any server with a modern CPU.

The contrast with ChaCha20 is stark. ChaCha20 was designed to be fast without hardware — its ARX operations (add, rotate, XOR) are cheap on any CPU. AES was designed for hardware from day one, and performs poorly without it. This is why Cloudflare and others deploy ChaCha20-Poly1305 for clients that lack AES-NI (older mobile devices, IoT) while defaulting to AES-GCM everywhere else.

The MixColumns circuit you just built is the exact computation that AES-NI performs in dedicated silicon — minus the pipelining and the parallel SubBytes lookup. In a real ASIC implementation, you’d unroll all ten rounds and instantiate four MixColumn blocks in parallel (one per column), giving you a fully pipelined single-cycle AES core.

Open the editor →Next: ChaCha20 in Hardware →

← Back to blog

AES in Hardware

SubBytes: The Lookup Table

Why not compute it?

XTime: Multiplication in GF(28)

MixColumns: The Diffusion Layer

Why AES-NI Exists

What AES-NI Actually Does

XTime: Multiplication in GF(2⁸)