The Graph Engine Rewrite: Why

The last several posts were all “Lux got faster” or “Lux got prettier.” This one is the opposite. Lux got slow, and I didn’t notice, and when I finally instrumented the frame I found a graph evaluator that was running on a data structure straight out of 2010.

This is the first of nine posts on rewriting the graph engine. It’s the manifesto post. No code. No feature. Just the reason the next eight posts exist.

The embarrassing audit

The F8 profiler from the last post was the thing that finally broke the dam. I turned it on during a session where I was building out a real patch — not a demo, an actual piece with 500-odd nodes, real spreads, real texture chains. I expected the render time to dominate, because render time always dominated.

Render was fine. Eval was 2.3 ms. On a 1000-node synthetic stress test it was 3.1 ms. At a 120 FPS frame budget of 8.33 ms, two of those milliseconds were coming from the graph evaluator alone, before the GPU had done anything.

I sat down and profiled the evaluator. What I found, in order of how badly I cringed:

~8,000 String allocations per frame at 1000 nodes. Every input pin name, every output pin name, every HashMap<String, PinValue> key — all of them allocated fresh strings. Most of the strings were “in”, “out”, “radius”, “color” — the same ~50 strings, born and dying sixty times a second.
~1,000,000 string hashes per second at 120 FPS. Every pin lookup went through HashMap::get(&str). Every lookup computed a hash of the pin name. The hashes were deterministic across frames; it didn’t matter. We were hashing the string “radius” a thousand times per second of clock time.
Full input-map .clone() per node per frame. The process context was built by deep-cloning every input map. Even for nodes that didn’t read any inputs. Even for frames where no inputs had changed.
Full prev_inputs.clone() per node per frame for change detection. Dirty-checking was implemented by cloning last frame’s inputs and comparing them structurally to this frame’s. For a spread of 10,000 values, that meant cloning 10,000 PinValues and then walking them element-by-element. At which point you may as well have just re-run the node.
Seven parallel HashMap<NodeId, _> sidecars. The graph stored nodes, inputs, outputs, connections, metadata, selection, and dirty state as seven independent maps keyed by the same NodeId. Every lookup hit seven cache lines. Every iteration walked seven different structures that should have been one.
Serial for idx in 0..topo_buf.len() single-thread evaluation. Rayon was in the dependency tree. We were auto-parallelising spread iteration inside a node. We were not parallelising the graph itself.
Full Kahn topological sort on every wire edit. Every time you dragged a connection, every time a node got added, the whole graph got re-sorted from scratch. O(V + E), which at 1000 nodes was ~180 microseconds of work on every connect(). Not a disaster, but a clear reflection of “we hadn’t thought about incremental edits.”
Spread(Vec<PinValue>) deep-clones on every wire hop. A 10,000-element spread crossing a wire was 10,000 PinValue::clone()s. A chain of 50 nodes passing that spread through was 500,000 clones per frame. At 120 FPS, that’s 60 million clones per second to do nothing to the data.

Every single one of those was solvable. Every single one of those was also load-bearing for every benchmark I’d published. The bloom chain I measured at “0.4 ms” last month assumed zero graph-eval overhead. Put it on a 500-node graph and suddenly your 0.4 ms pass is running inside a 2.3 ms wrapper. Every 120 FPS claim I’ve made in these posts was measured against the wrong baseline.

I wrote myself a note that said, roughly: “Lux does not ship a 120 FPS claim until the graph engine matches the 2026 SOTA data-oriented design. Period.”

That’s the line I held myself to. No more “feels smooth.” No more “fast enough.” Real numbers. Real baselines.

The seven P-gates

The rewrite has seven measurable performance targets. Each one is a benchmark; each one has a regression gate in CI. Passing all seven is the definition of “done.”

#	Metric	Baseline (audit)	Target
P1	1000-node linear-chain eval p50	~2.3 ms	< 1.0 ms
P2	1000-node wide-fan-out eval p50	~1.8 ms	< 0.5 ms (rayon level-parallel)
P3	Connect op p50	~180 µs (full Kahn)	< 5 µs (Pearce-Kelly)
P4	1000-elt Spread<f64> 50-hop fan-out	~12 ms (deep-clone)	< 50 ns (Arc bump)
P5	Dirty-gate unchanged-subtree skip	full deep-compare	zero compares (gen-seq match)
P6	String allocations per frame	~8,000	0 (PinId-indexed)
P7	Pin-name hash ops / second	~1,000,000	0

P4 is the most dramatic number. Going from 12 ms for a spread fan-out to 50 ns is a factor of ~240,000×. That’s not tuning; that’s a different shape. The current Spread(Vec<T>) structurally cannot do better because Clone means “give me my own copy.” The new structure (Arc-wrapped columnar slices) does O(1) regardless of element count because the clone is a refcount bump, not a copy.

P6 and P7 are the ones that say “no strings in the hot path, ever.” Not “fewer strings.” Not “cache the strings.” None. The hot path has to be pin IDs, and pin IDs have to be integers.

Every one of these gates is also an adversarial number. If the rewrite ships and one of them regresses by more than 3%, the whole thing reverts.

The four artist outcomes

The P-gates are the engineering goals. The actual point of the rewrite is four things an artist using Lux should feel:

#	Outcome	Why the rewrite delivers it
A1	Wire a new node during a live set with zero visible hitch at 120 FPS	Pearce-Kelly incremental topo: < 5 µs per connect
A2	Slider drag reflects in one frame on a 500-node patch	Per-pin gen-seq dirties only the downstream subtree
A3	Fan out 10,000 circles from one `LinearSpread` with no penalty	`Arc<[T]>` columnar spreads: O(1) wire hop
A4	Type WGSL edits into a Shadertoy node and see the next frame update	Per-pin gen-seq isolates the edit; rest of graph cached

These aren’t stretch goals. They’re the reasons to do any of this. The performance numbers exist in service of these four experiences, not the other way around. If the numbers all hit and the editor still feels laggy during a live set, the rewrite isn’t done.

The waves

Because a rewrite this big inside a shipping app is a disaster unless it’s sliced, the plan has three waves:

Wave 1 — data layout changes. SlotMap arena for nodes. Per-node generation counters. PinId + the pins! macro. The app decomposition into seven modules. The regression-gate benchmark harness. These can land independently and don’t change evaluator semantics.

Wave 2 — algorithmic changes. Pearce-Kelly dynamic topo. Rayon level-parallel eval. Arc-wrapped spreads. These depend on the data changes from wave 1 but don’t depend on each other beyond that.

Wave 3 — verification and final perf. Bit-identical golden lockdown for every pre-rewrite test patch. iai-callgrind instruction-count regression gates. Proptest-based invariant checking. The last few allocations in the clean-subtree and value-unchanged paths.

Each wave has its own internal dependencies; none of them move unless their upstream wave is green. Every landing commit updates the bench baseline. Every regression fails CI before merge.

The regression gate contract

The rewrite ships against a specific contract:

cargo bench --workspace shows all 7 perf gates green AND every existing benchmark stays within ±3% of its pre-rewrite numbers, on the same hardware.

If a benchmark regresses by more than 3%, the rewrite doesn’t land. If any visual-quality test fails, the rewrite doesn’t land. If a live-set hitch exceeds 12.5 ms on phase2_cue_transitions.lux, the rewrite doesn’t land.

That sounds harsh. It has to be. The reason to do a rewrite is because you’re willing to accept short-term risk for long-term correctness. The reason to not do a rewrite is because the short-term risk is higher than the long-term gain. A rewrite that ships with regressions is the worst of both.

What I’m going to write about

The next eight posts are each one chunk of the rewrite. In order:

SlotMap + generation counters — the new node storage.
PinId: the death of HashMap<String, _> — the macro and the constants.
Breaking up LuxApp — the seven-module decomposition.
Pearce-Kelly dynamic topo sort — incremental edges.
Parallel evaluation with rayon — level-buckets.
Spread becomes Arc<[T]> — columnar slices.
Bit-identical goldens + iai-callgrind — the regression gates.
The last few allocations — the final perf lap.

Nine posts total, counting this one. The rewrite itself took about three weeks. The posts are going to take about three months, because you write faster than you think, and I’ve been thinking about this rewrite for longer than I’ve been writing about it.

Honest note: I’m pretty sure about the direction of all of this. The P-gates, the waves, the artist outcomes — those all feel right. I’m much less sure about specific micro-decisions that will come up in the next eight posts. There are probably three places in the rewrite where I’ve picked the wrong variant of the right idea, and I don’t know which three yet. We’ll find out together.

I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.