The Last Few Allocations

This is the last post in the graph engine rewrite series. Nine posts total, counting the manifesto. The algorithmic core shipped two posts ago; the regression gates locked it down last post. What’s left is the final cleanup pass — three micro-optimisations that close the gap between “every gate is green” and “every gate is green with comfortable headroom.”

None of these are glamorous. Each one is a few lines. Cumulatively they take the unchanged-subtree benchmark from 886 µs to under 10 µs, which is the kind of improvement that only happens when you’re in the final stretch of a rewrite and looking for single-digit wins.

Value-unchanged gen-bump skip

The big one. The generation-counter dirty gate bumps a pin’s gen on every wire transfer. The gen-bump drives downstream dirty propagation: a downstream node’s should_evaluate compares input_pin_gens[pin] against last_eval_gen and runs if they differ.

The first version bumped the gen every time the wire transferred anything, regardless of whether the transferred value was actually different from what the pin already held. Which meant: a producer that outputs the same value every frame (a constant, an unchanged smoothing result, an idle spread) still caused its downstream to see gen bumps and re-run.

In my 1000-node linear chain benchmark, 100 always-dirty heads (Time, Mouse, whatever) re-run every frame and emit new outputs. The remaining 900 nodes are downstream of the heads and their inputs are unchanged frame-to-frame. The first rewrite version bumped gens on all 900 wire transfers, the 900 downstream nodes all re-evaluated, and the unchanged-subtree bench sat at 886 µs per frame.

The fix:

if !new_value.eq_cheap(&current_value) {
    input_pin_gens[pin] += 1;
}

The eq_cheap call is O(1) for every type thanks to the Arc-spread work. A wire transfer whose new value is pointer-identical to the existing value skips the gen bump. The downstream pin’s gen doesn’t move. should_evaluate returns false. The node skips its frame entirely.

That one check moves the 886 µs down to ~120 µs. Most of the cost was the gen bumps themselves and the subsequent dirty propagation they caused. Remove the cascade and most of the frame evaporates.

The invariant I had to preserve: any real mutation has to bump the gen. eq_cheap is allowed to return false when the values are actually equal (a false-negative — “I don’t know if they’re equal, assume not”), because that causes a spurious evaluation which is slow but not incorrect. It is NOT allowed to return true when the values differ (a false-positive), because that causes a missed evaluation and a stale output. Every eq_cheap impl is written to fail closed: return true only when pointer or bit equality is certain; return false otherwise.

A proptest enforces this: generate random pairs of PinValues, assert that eq_cheap(a, b) == true implies a == b (strict equality). Ran for 10,000 iterations, zero violations. That’s the correctness footing the whole free-pass depends on.

Per-target connection index

The second optimisation is structural. transfer_wire_data_into used to scan every row in connections to find the ones whose destination matched the current target node:

for conn in &connections {
    if conn.to_node == target_node {
        // transfer
    }
}

O(|E|) per target. For a 1000-node graph with 1500 edges, that’s 1500 compares per call, 1000 calls per frame (one per node), 1.5 million compares per frame just to find which wires feed which node. Every one of those compares was pointless if the node had no incoming wires, and most of them were pointless anyway.

The fix is a per-target adjacency index:

incoming_conn_idx: HashMap<NodeId, Vec<ConnectionIdx>>,

Built incrementally on every connect/disconnect (piggybacking on the Pearce-Kelly adjacency caches, which already maintain a per-node connection list for a similar reason). transfer_wire_data_into now reads only the indices pointed to by incoming_conn_idx[target], which is O(indegree) per target.

Total wire-transfer cost goes from O(|E|) per node to O(indegree) per node. Summed over all nodes, O(Σ indegree) = O(|E|) total, which is the theoretical minimum. Can’t do better than that without changing the problem.

For the unchanged-subtree bench, this takes the 120 µs from the previous optimisation down to ~25 µs. The remaining 25 µs is the per-node should_evaluate check itself, which can’t be skipped — the evaluator has to look at every node to decide whether to run it.

Or can it. One more optimisation below.

Processed-set predecessor short-circuit

The third optimisation is in dirty propagation, which happens at the start of the eval phase, before any node actually runs. The propagation walks forward from dirty roots marking everything downstream as dirty. The walk visits each node’s successors and marks them dirty.

The first rewrite version visited every reachable dirty node. For the unchanged-subtree case where only 100 of 1000 nodes are actually dirty, it still walked the full reachability closure of those 100 roots — which in the linear chain is all 1000 nodes, because the 100 dirty heads feed the 900-node downstream chain.

But most of those visits are wasted. A downstream node’s input_pin_gens only rose if the wire transfer actually bumped them, which (after the first fix above) it mostly doesn’t. So the downstream isn’t actually dirty. So visiting it and marking it dirty is wrong — it won’t re-evaluate anyway because should_evaluate returns false.

The fix is a two-pass approach. First pass: transfer wires (which bumps gen only when values actually change). Second pass: walk should_evaluate only on the nodes whose incoming gens might have changed, which is the subset of nodes whose predecessors actually re-ran and produced new outputs. Track this with a “actually-processed” set — only nodes that evaluated and produced output this frame count as dirty seeds for their successors.

let mut processed: SmallVec<[NodeId; 32]> = smallvec![];
for level in &graph.levels {
    level.par_iter().for_each(|&node_id| {
        if graph.should_evaluate(node_id) {
            graph.evaluate_one(node_id, ctx);
            processed.push(node_id);
        }
    });
    // Next level's should_evaluate only needs to check nodes whose
    // predecessors are in `processed`.
}

The processed set is typically small — maybe 10% of total nodes on a typical frame. Successors of nodes not in processed can skip the should_evaluate check entirely because their inputs haven’t changed since last frame.

This takes the 25 µs down to 8 µs on the 1000-node unchanged-subtree bench. Which is below the 10 µs P5 gate with comfortable headroom.

The counted-allocator test tightens

The counted-alloc test from the PinId post was asserting ≤ 1 allocation per frame. The 1-allocation slop was from the log pipeline (the timestamp formatter allocates a String per log line). Across this final wave, the last internal alloc site got tracked down: the log formatter’s format!() was unnecessary for the hot-path log::trace! calls that don’t actually run in release builds. Swapping to log::log_enabled! + explicit formatting eliminated the string allocation entirely.

Post-fix: 0 allocations per frame at 1000 nodes after warmup. The test now asserts equality instead of inequality:

assert_eq!(alloc_count_after - alloc_count_before, 0);

P6 is now literally zero, not “effectively zero.” Which I’m disproportionately pleased about, because the difference between “≤ 1” and “= 0” is, empirically, just a number, but it also means the assertion catches any regression that introduces an allocation anywhere in the hot path.

The final scoreboard

Looking back at the manifesto:

Gate	Target	Before	After
P1 1000-linear eval p50	< 1.0 ms	2.3 ms	0.68 ms
P2 1000-fanout eval p50	< 0.5 ms (rayon)	1.8 ms	0.38 ms
P3 connect p50	< 5 µs (Pearce-Kelly)	180 µs	4.1 µs
P4 1000-elt spread 50-hop	< 50 ns (Arc)	12 ms	370 ns
P5 unchanged-subtree skip	0 compares (< 10 µs)	886 µs	8 µs
P6 string allocs / frame	0	~8,000	0
P7 pin-name hashes / sec	0	~1,000,000	0

Seven of seven green. Every pre-rewrite benchmark within 3% of its baseline (most of them faster). Every visual-regression golden bit-identical within 1 LSB. The iai-callgrind companion benches in CI with 1% regression gates. The proptest invariant checks in CI. The counted-alloc tests in CI.

The rewrite is done. I don’t use that word lightly — there’s always more to do — but the contract I set for myself in the manifesto is met. The engine that was “functional but 2010-shaped” is now “functional and 2026-shaped.” Every artist-facing outcome (live-set wire drags with no hitch, 500-node slider drags in one frame, 10K-element spread fanouts for free, WGSL edits without re-evaluating the whole graph) works the way the manifesto promised.

If you’ve read all nine posts in the series, thank you for sitting through it. The next section of the blog is lighter — features that land on top of the new engine instead of rewrites that replace it. Projection mapping is first, and the demo is way more fun to look at than this table.

I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.