Parallel Evaluation with Rayon

Rayon has been in Lux’s dependency tree since the very beginning. Spreads use it for parallel map/filter/reduce operations when the spread is big enough. The graph evaluator itself, though? Single-threaded since day one. A sequential for idx in 0..topo_order.len() loop that walked the topo order one node at a time.

Which is a thing you can get away with when every node is a cheap math op. It’s a thing you cannot get away with when your graph has a dozen texture nodes that each dispatch a shader, or an SDF subgraph that compiles WGSL, or fifteen spreads each doing their own parallel work that the outer serial loop blocks waiting for.

This post is the switch to per-level parallel evaluation.

The DAG is already parallel

A topologically-sorted DAG has a natural partition into levels: nodes at level k have their longest path from a root equal to k. Level 0 is all source nodes. Level 1 is nodes whose deepest predecessor is at level 0. And so on.

The key invariant: two nodes at the same level can never have an edge between them. Because if they did, one would be an ancestor of the other, and the descendant would be at at least one level higher.

That invariant is exactly what you need for parallel evaluation. Within a level, every node’s inputs come from earlier levels (already evaluated); every node’s outputs go to later levels (not yet evaluated). There are no inter-node dependencies within the level, so you can evaluate every node in the level concurrently without coordination.

Across levels, you have a barrier: you can’t start level k+1 until every node in level k has finished. But the per-level barrier is cheap; the within-level parallelism is free.

compute_levels

The partition is a new field on Graph:

pub struct Graph {
    // ...
    topo_order: Vec<NodeId>,           // flat ordering (from Pearce-Kelly)
    levels: Vec<SmallVec<[NodeId; 8]>>, // partition
}

levels[i] is a SmallVec of the nodes at level i. Most levels are small enough to fit inline in the 8-element SmallVec, which means the partition typically has zero heap allocations. Wide levels spill to the heap, which is fine.

Computing the partition is O(|V| + |E|):

fn compute_levels(&mut self) {
    self.levels.iter_mut().for_each(|v| v.clear());
    let mut depth = HashMap::with_capacity(self.topo_order.len());
    for &node in &self.topo_order {
        let d = self.backward_adj.get(&node)
            .map(|preds| preds.iter().map(|p| depth[p] + 1).max().unwrap_or(0))
            .unwrap_or(0);
        depth.insert(node, d);
        if self.levels.len() <= d { self.levels.resize_with(d + 1, Default::default); }
        self.levels[d].push(node);
    }
}

Walk the topo order. For each node, its level is max(predecessor_level) + 1, or 0 if it has no predecessors. Clear the level vectors but don’t free their backing storage — capacity persists across frames so the partition allocates on first frame and then amortises to zero.

Rebuilding happens only when topo_dirty flips, which the Pearce-Kelly work from last post keeps minimal. A forward-edge insert doesn’t dirty the topo; a backward-edge insert does, but the redistribute only touches the affected region, so the level partition only needs to update within that region. For typical edit patterns, levels rebuild maybe once per second of active editing.

The rayon switch

With the partition in place, the evaluator changes from:

for &node_id in &graph.topo_order {
    graph.evaluate_one(node_id, ctx);
}

to:

for level in &graph.levels {
    level.par_iter().for_each(|&node_id| {
        graph.evaluate_one(node_id, ctx);
    });
}

That’s the actual change. The framework is rayon’s thread pool, which is already configured at app startup. The for_each blocks until every node in the level finishes; the outer for moves to the next level.

The actually hard part was the &mut ctx. The sequential version has unique mutable access to the evaluation context; every call to graph.evaluate_one hands it the &mut. The parallel version can’t give out multiple &muts to the same context, which is the Rust thing.

The fix is to split the context. Pieces that are logically per-node (scratch buffers, the current NodeId, the pin-change set) become per-node locals that rayon hands out via thread-local allocation. Pieces that are shared (the texture op queue, the mesh op queue, the scene ops) become Arc<Mutex<_>> with fine-grained locking only around the push. Reading from the context (input values, previous outputs, pin metadata) happens via shared immutable references that every thread in the level holds concurrently.

Most of the cost of getting this right was tracking down every place that implicitly needed &mut but didn’t actually write anything — Rust’s borrow checker is helpful but it’s not smart enough to tell you “this looks like a mutation but your logic doesn’t need it.” A few methods got reshaped to take &self instead of &mut self; a few ad-hoc caches got moved to thread-local storage; everything else fell out naturally.

Spread native stays sequential

One subtlety. A handful of nodes are marked spread_native, which means they handle spreads internally rather than being auto-iterated by the evaluator. Merge, RenderScene, reducers like Sum and Average — they consume spreads directly.

These nodes often have parallel work inside their process() (a reducer with 10K elements uses par_iter internally). If the evaluator also runs them in parallel with other spread-native nodes at the same level, you get nested parallelism, which rayon handles but which can thrash the thread pool if the outer and inner work don’t balance.

The current approach is pragmatic: let it happen. Rayon’s work-stealing scheduler handles nested par_iter reasonably well. The few test cases I could construct that might pathologically over-subscribe didn’t show up as regressions in any bench. If a real patch reveals this as a problem I’ll add per-node “prefers serial dispatch” flags, but I didn’t ship them speculatively.

The numbers

The P2 gate was “1000-node wide-fan-out eval p50 < 0.5 ms.” The benchmark: a graph where one source node feeds 999 trivial workers in parallel, each doing a few math ops.

Before rayon: p50 1.28 ms (single-threaded, limited by per-node overhead × 999).
After rayon: p50 0.38 ms on an 8-core machine.

Clears the gate. The scaling isn’t linear with cores — at 8 cores I’d expect ~0.16 ms if everything scaled perfectly, and I’m seeing ~0.38 ms. The gap is the per-node fixed overhead (context handoff, thread pool dispatch) that doesn’t parallelise. Plus the rayon overhead for splitting work items at ~8 µs per node, which is a lot if your nodes are individually cheap.

For graphs with expensive per-node work (texture ops, shader dispatches), the scaling is closer to linear because the fixed overhead is dominated by per-node compute. That’s exactly the case where parallelism matters most.

For the linear-chain benchmark (1000 nodes in a strict chain, one level per node), parallel eval gives zero speedup because every level has exactly one node. That’s correct behaviour; a chain is inherently sequential. The 1000-chain P1 gate (< 1.0 ms) is hit by the serial improvements from SlotMap and PinId, not by rayon.

The correctness guarantee

The thing I most wanted to verify is that parallel evaluation produces the same results as sequential evaluation. The DAG invariant should guarantee this, but “should” isn’t the same as “does.”

The P-gate regression harness from the bit-identical-goldens post does exactly this check. Every pre-rewrite test patch renders identically on the post-rewrite engine, with parallel eval on. Pixel-identical. If a parallel dispatch reordered something in a way that leaked through, the goldens would diverge.

They don’t. Which is reassuring.

I also added a specific stress test: parallel_eval_matches_sequential_on_random_graphs. It builds 100 random DAGs with 50-200 nodes each, evaluates each one sequentially, evaluates each one in parallel, and asserts bit-for-bit output equality. Runs in CI. Caught exactly zero bugs so far, which is either evidence that the evaluator is correct or evidence that my random-DAG generator isn’t adversarial enough. I’m choosing to believe the former.

What’s in the headroom

The per-level barrier is a synchronisation point. For graphs with lots of short levels (like the 1000-chain case), the barrier cost dominates. A fancier evaluator would use task-graph scheduling — fire a node as soon as all its predecessors finish, regardless of level — which would let long-running branches overlap with short-running ones across level boundaries.

I’m not doing that yet. The level-based approach is simple, correct, and fast enough for the P2 gate. Task-graph scheduling has more moving parts and isn’t obviously better for the graphs Lux users are going to build. If a real patch shows up where per-level sync is the bottleneck, I’ll revisit. Otherwise, levels.

One more post in the algorithmic wave. Spread becomes Arc — the P4 gate, which is where the rewrite’s most dramatic number lives.

I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.