Hi everyone,

When I sat down to continue writing about Willow’s system-level benchmarks, I didn’t make it past the first one: quantum error correction. It truly is one of the finest rabbit holes in this field 😄. But laying out the process over and over again matters, because it explains what is being benchmarked, why those benchmarks exist, and why improving the building blocks underneath remains so critical.

In my earlier breakdown, we looked at Willow’s qubit-level specs: numbers, coherence times, gate fidelities. But those alone don’t make a processor useful. Many labs can show >100 μs coherence on a handful of qubits, yet we don’t hear about them achieving logical performance. Why?

Because good physical qubits are only the starting point.

Qubits drift. Defects appear. You need constant recalibration. And even when they look good, the native error rates are still far too high to do anything useful. That’s why we group physical qubits into logical qubits - distributing information across many qubits to make it more robust.

Some qubits in this ‘logical’ group store the data, others act as ancillas whose only job is to catch errors. They get entangled with the data qubits and measured again and again and again. The measurements don’t correct anything directly - they only give us “syndrome” information: a history of where something might have gone wrong. A fast decoder then has to interpret these syndromes and keep track of errors in real-time.

Timing here is everything.

Superconducting qubits decohere in ~100 μs. That means the error detection cycle has to be an order of magnitude faster, otherwise errors pile up. Willow runs cycles of 1.1 μs, close to a million per second. Each cycle includes ancilla reset (tens of ns), entangling gates with data qubits (40–200 ns each, run in parallel), ancilla measurement (200–400 ns), and processing of the measurement data.

Now, note that there is still no “correction” in the physical sense.

Instead, this is where the so-called Pauli frames come in. The Pauli operators—X, Y, Z—are the simplest forms of quantum errors: a bit flip, a phase flip, or both. If the decoder detects such an error, instead of immediately giving the instruction to flip the qubit in hardware, the system just records it in a Pauli frame.

You can think of the frame as a classical ledger that keeps track of how the logical qubit should be reinterpreted. The quantum state is left untouched. During the computation, all gates are applied as usual, but in the background the Pauli frame is continuously updated. Only at the very end - when the logical state is read out - is the accumulated frame applied in software to interpret the result correctly.

This is faster, cleaner, and far more scalable than attempting physical corrections on the fly.

So how do we know if all of this actually works?

That’s where the scaling factor Λ (Lambda) comes in. Λ is a central metric for error correction because it captures whether your logical qubit actually improves as you scale the code.

If Λ < 1, then more qubits only make things worse: the system accumulates noise faster than it can handle.

If Λ ≈ 1, you are at break-even: error correction is just about holding the line, with logical qubits no better than their physical components.

But if Λ > 1, you’ve crossed into the regime where each increase in code distance compounds reliability. The larger the logical qubit, the lower the effective error rate. That is the transition point between error correction being a demonstration and it becoming a scalable tool.

Willow reports Λ = 2.14. This means the logical error rate per cycle improved by a factor of 2.14 when scaling from the smaller code to the larger one. It shows that adding more qubits to the code genuinely improves Willows performance, proving that the surface code is doing its job.

So finally, we get to a system-level metric. It’s not enough to show that Λ > 1 in isolation. We need to know how error correction holds up under the weight of a task.

Ideally that would be a useful application. But at this stage of quantum computing, stress tests will do. And stress tests are nothing unusual: they are part of the playbook in any field of hardware engineering. So instead of debating the choice of application that was run on Willow, I prefer to view Google’s large random circuit sampling experiment simply for what it is: a stress test, pushing the full stack to its limits and showing that the system can hold together.

More on the details of random circuit sampling and XEB next time 🙂

Talk to you soon,

Reply

Avatar

or to participate

Keep Reading