Simten

Sorting Networks

A fixed wiring of comparators that sorts any input in the same number of steps — no branches, no loops, just parallel hardware. The algorithm behind network switch fabrics, GPU sort, and median filters.

Interactive tutorial/~8 min read/Built with Simten

The Compare-and-Swap Primitive

Every sorting network is built from a single building block: the compare-and-swap. Give it two values and it always puts the smaller on lo and the larger on hi. If the inputs are already in order, nothing changes. If they are out of order, they are swapped. Either way, the operation completes in the same number of gate delays.

There is no branch instruction deciding which path to take. A Comparator asserts its lt output when a < b, and that single bit drives two Mux nodes — one to route the minimum, one to route the maximum. The circuit evaluates in parallel every cycle regardless of the values flowing through it.

Try changing inputs a and b below. No matter what values you enter, lo always shows the smaller and hi always shows the larger.

Compiling...
Compare-and-Swap
Two inputs enter, the smaller exits on `lo` and the larger on `hi`. No branches — a Comparator drives two Muxes.

Building a Sort Network

String five compare-and-swaps together in the right order and you have a network that sorts four values in three parallel stages. This is Batcher’s odd-even merge sort for n = 4:

Stage 1: (0,1) (2,3)
Stage 2: (0,2) (1,3)
Stage 3:       (1,2)

Each pair notation (i, j) means “compare elements at positions i and j; put the smaller at i.” Pairs in the same stage have no data dependencies so they run simultaneously — the hardware evaluates all of Stage 1’s comparators at once, then all of Stage 2’s, then Stage 3.

The intuition behind why it works: Stage 1 sorts each pair independently (the two “halves”). Stage 2 merges the minimums together and the maximums together — after it, positions 0 and 3 already hold the global minimum and maximum. Stage 3 fixes the only remaining disorder: positions 1 and 2.

Change any of the four input values below. The outputs update instantly — this circuit has no clock, no loop, and no conditional logic anywhere.

Compiling...
4-Element Batcher Sort Network
Three stages, five comparators. Any four 8-bit values emerge sorted in ascending order. Change the inputs and the result updates instantly.

Adding a Pipeline

The combinational network above sorts instantly — but it can only process one set of inputs at a time. While the comparators are busy evaluating one frame of values, no new data can enter.

The fix is straightforward: insert a register bank between each comparator stage. Registers capture the intermediate results on every clock edge, so the three stages become independent. Stage 1 can start sorting a new set of inputs at the same moment Stage 2 is processing the previous set and Stage 3 is finishing the one before that.

The first sorted result takes 3 clock cycles to emerge — one per stage. But after that initial latency, a new sorted result arrives every single cycle. Throughput is 1 sort/cycle regardless of how long the combinational logic inside each stage takes. This is exactly how real FPGA and ASIC sort engines are built.

Step through the circuit below and watch the values propagate stage by stage. After 3 ticks the outputs stabilize into ascending order; every tick after that could carry a completely different input batch through the same pipeline.

Compiling pipelined circuit...

Why Hardware Needs Sorting Networks

A software sort like quicksort or mergesort has data-dependent branches and pointer chasing through memory. The CPU must predict which branch to take, stall on cache misses, and retire instructions one at a time through the reorder buffer. Latency is unpredictable, and pipelining across the sort boundary is difficult.

A sorting network sidesteps all of that. The wiring is fixed at design time — no control flow, no memory access pattern that depends on the data. Every comparator in each stage evaluates in parallel. The depth of the network (the number of sequential stages) is O(log² n), so a 16-element Batcher network needs only 10 stages. Timing is completely deterministic.

Batcher-Banyan switch fabrics wired sorting networks together in ATM switches during the 1990s. A Batcher sorter ranked incoming cells by destination address; a Banyan network then routed them conflict-free. The whole switch ran at line rate because every path through the silicon took exactly the same number of gate delays.

Median filters in image-processing ASICs use small sorting networks (typically 9 or 25 elements for 3×3 or 5×5 kernels) to extract the median pixel value without distorting edges the way a blur would. Sorting every 3×3 neighborhood in software would be a branch-heavy inner loop; in silicon it is just wires and comparators that run every pixel clock.

GPU sort pipelines (Bitonic sort, Odd-even merge sort) use sorting networks as the inner kernel. A wavefront of threads each owns a small slice; the fixed compare-and-swap pattern maps cleanly onto SIMD lanes because every thread executes the same instruction at the same time — no divergence, maximum throughput.

Finally, a pipelined sorting network achieves sustained throughput of one sort per clock cycle. Add a register stage between each comparator layer and you can feed a new set of values every tick. Our 4-element network has 3 stages, so after 3 cycles of latency the first sorted result arrives — and then one arrives every cycle thereafter. Software can never match that for fixed-size inputs.