Bit-Identical Goldens + iai-callgrind

The first seven posts of the rewrite changed how almost every piece of the evaluator worked. Seven perf gates hit their targets. Seven places where the engine got faster.

Which is only half of the contract. The other half is that not a single pixel of observable output should have changed. Every test patch that rendered a given image before the rewrite renders the same image after. Every numerical output that was X is still X. Every pin value that propagated in order A-then-B still propagates A-then-B.

This post is how I proved it.

Two tiers of golden

The existing golden test framework (used throughout the blog for visual confirmation, 3D scenes, particles) is a family-similarity test: render a patch, compare to a reference PNG, assert SSIM ≥ 0.998 and max ΔE76 < 1.0. That’s a few LSB of tolerance and maybe a pixel or two of drift. Perfectly appropriate when you’re developing a renderer — floating-point math wiggles, GPU drivers update, hardware changes — and you’re checking for “looks the same to a human,” not “is byte-identical.”

For the rewrite it’s the wrong tolerance. A data-layout change that somehow perturbs a colour by one LSB means the change isn’t what it claims to be. SSIM 0.999 hiding a real semantic bug is the exact failure mode I’m trying to prevent.

So the rewrite’s regression suite is a stricter tier. Two checks, both must pass:

SSIM ≥ 0.9999 — about ten times stricter than the family gate. Tolerates single-LSB gradient drift from the GPU driver’s fused-multiply-add precision but nothing else.
Max per-channel u8 diff ≤ 1 — no pixel is off by more than a single LSB in any channel. Zero tolerance would flake on HSV gradients (where the conversion math accumulates a LSB of rounding noise across the span); two LSBs would let visible banding through. One LSB is the calibrated number.

Both gates together are tighter than the family gate and looser than “byte-identical PNG”, which is the right place — byte-identical would fail on any GPU driver upgrade that touches the floating-point pipeline, and we don’t want to ship a regression suite that flakes on version bumps.

The new test files are tests/phase1_baseline_goldens.rs and tests/phase2_baseline_goldens.rs, corresponding to the two eras of pre-rewrite test patches (2D patches from the first half of the blog; 3D patches from the later half). Every patch paired with a PNG under tests/goldens/ gets a lockdown test. Patches without a paired PNG get an #[ignore]’d placeholder so the coverage gap is visible in the test listing rather than silently absent.

The diff output writes a 3-panel PNG (expected, actual, diff) to target/test-diffs/ on failure. Same shape as the family harness — when a test fails, a reviewer can open the diff and see exactly which pixels moved and by how much. This caught one genuine regression during the rewrite (a spread iteration order change that shifted an SDF’s sample pattern by a fraction of a pixel) that had passed the family gate without me noticing.

Warmup matters

One subtlety about golden tests on stateful patches. Particle systems, feedback loops, and TAA-enabled scenes all need to run for N frames before their output stabilises. The particles post introduced the --warmup N CLI flag specifically for this.

For the lockdown suite, warmup is a test-harness parameter. 3D scenes that use TAA or bloom history use WARMUP_FRAMES = 4 to match the temporal convergence the family suite assumes. Particles use 90 frames. 2D patches use 0 because none of them are stateful at this point.

If warmup were wrong, the lockdown would compare “frame 0” output against “frame 4” reference and fail every time. Keeping them in sync is a contract that the harness enforces.

iai-callgrind

Criterion benchmarks measure wall-clock time. Which is the right unit for “how fast does this actually run”, and the wrong unit for “did this change regress by more than 1%.”

Wall-clock is noisy. On a CI runner with neighbour VM contention, thermal throttling, and whatever scheduling the kernel is doing, the same exact code can run 3-5% slower on one run than the next. A 1% regression gate on wall-clock numbers would flap daily; a 5% gate would miss real regressions.

iai-callgrind is the fix. It runs benchmarks under valgrind’s callgrind, which counts CPU instructions executed rather than measuring wall time. Instruction counts are deterministic to within a handful of ops across runs on the same binary — the kernel scheduler, cache state, and neighbour VMs don’t affect them. A 1% gate on instruction counts is tight, reproducible, and catches real regressions without flapping.

The companion bench file (benches/graph_eval_iai.rs) duplicates the four main perf-gate benchmarks — 1000-linear, 1000-fanout, unchanged-subtree, connect-1000-node-graph — with the same graph shapes, same inputs, same evaluator calls. Only the measurement harness differs. Criterion numbers are for the human-readable “how fast”; iai numbers are for the CI regression gate.

The CI job lives in .github/workflows/ci-bench.yml and runs on ubuntu-latest (valgrind is a Linux thing; macOS and Windows don’t get the instruction-count gate). Each bench’s instruction count is compared against a baseline stored in the repo under benches/iai-baselines/. Any drift of more than 1% fails the job. Baselines update via a manual process (edit the file, commit, justify in the PR description).

The full regression matrix

With both tiers in place, every rewrite commit runs through:

cargo test –workspace — the ~4,000 existing unit and integration tests. Correctness gate.
Bit-identical golden tests — every pre-rewrite test patch renders SSIM ≥ 0.9999 / max-diff ≤ 1 against its reference. Visual regression gate.
cargo bench (criterion) — the seven perf gates. Target gate.
cargo bench –bench graph_eval_iai (iai-callgrind) — instruction-count regression at 1%. Regression gate.

A commit lands only if all four pass. If any single one fails, the commit goes back for revision. If multiple pass but one fails, it’s still a no-merge. No partial credit.

This sounds annoying and is. It’s also the reason I can now cite P1 through P7 as green without anyone having to trust me — the evidence is in CI, visible on every PR, reproducible on any Linux machine.

Proptest for eval invariants

One more regression tool landed during this wave. tests/graph_eval_proptest.rs uses the proptest crate to generate random valid graphs, run them through the evaluator, and assert a set of invariants:

Every node’s outputs match what a direct call to node.process() would produce, given the same inputs.
The order nodes evaluate in is a valid topological order of the graph.
Parallel and sequential evaluation produce bit-identical outputs.
Adding and removing edges in arbitrary orders, then re-evaluating, produces the same output as evaluating the final graph from scratch.
Undo followed by redo brings every node’s state back to exactly where it was.

proptest generates ~1000 random graphs per test run, each with 10-100 nodes and randomised connections. The shrinker finds the smallest counter-example when any invariant fails. Over the week I spent running this in a tight loop, it found two bugs I hadn’t seen any other way:

A rare case in the parallel evaluator where two nodes at the same level both wrote to a shared scratch buffer. Fixed by thread-local scratch.
A Pearce-Kelly edge case where removing a node whose ordinal was at the extreme of the range left the ordinal table in a slightly inconsistent state that didn’t cause immediate failure but tripped an assertion on a later operation. Fixed in the remove path.

Both would have survived the hand-written tests. Both were caught by random generation. I’m going to keep proptest in the regression rotation for any future graph-engine work.

The one genuine regression I caught

During the rewrite, the bit-identical lockdown tripped exactly once. An SDF test patch showed a ~0.3 LSB drift on a gradient that was supposed to be pixel-identical.

Root cause: the rayon parallel eval was iterating a spread in a slightly different order than the sequential eval, and the spread happened to contain float values whose summation order affected the final result in the last ULP. (a + b) + c ≠ a + (b + c) for floats, and the parallel version was accumulating in a different associativity.

Fix: the order-sensitive reducer nodes (Sum, Average, StdDev) now use a deterministic parallel reduction (parallel tree reduction over a fixed sort order) instead of rayon’s default work-stealing reduction. The tree reduction has the same work as the default but guarantees the associativity is stable.

Without the lockdown, I wouldn’t have caught this. The family gate would have passed (0.3 LSB is well within its tolerance). The perf bench would have passed. Everything would have shipped, and one user somewhere would eventually have noticed that their fractal flame accumulator looked very slightly different after the upgrade and we’d have had no idea why. Tight gates catch the bugs tight gates are for.

The framework’s shape

Lockdown goldens now ship as a general test class, not just for this rewrite. Any future structural change that claims to preserve observable output routes through the same suite: bit-identical tier for structural changes, family tier for renderer changes, both running in CI. New test patches picking up a paired PNG get lockdown coverage automatically.

The --bit-identical mode of the golden harness will probably also be useful for external contributors, who can point at their branch and say “lockdown passes” as shorthand for “structural only, no observable change.” Which is the kind of shorthand that makes review tractable.

Next post is the final post in the rewrite series. Three last allocation hunts that close the remaining gap between “very fast” and “demonstrably allocation-free at steady state.”

I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.