Pearce-Kelly Dynamic Topological Sort

Lux’s graph evaluator needs a topologically sorted order. Upstream nodes have to run before their downstream dependents, otherwise downstream reads get stale data. Every frame walks the sorted order; every connect/disconnect might invalidate it.

Until this post, the invalidation path was “throw out the old sort and do a fresh Kahn’s algorithm.” O(|V|+|E|). At 1000 nodes, ~180 µs on my desktop. Fine when you connect two wires a minute. Not fine when you’re live-patching during a set and dragging a dozen wires in a second.

The fix is a 2007 paper by David Pearce and Paul Kelly: an incremental topological ordering that updates in O(|δ|) — the size of the region actually affected by the new edge, not the size of the whole graph. In practice, for most edits, |δ| is 0 or 1.

The P3 gate

The manifesto put the connect-op target at < 5 µs p50. That’s 36× faster than the existing full-Kahn path. It’s also ambitious — even an optimal O(|δ|) algorithm has some constant overhead for the bookkeeping.

The Pearce-Kelly gamble is that for a realistic editing workflow, |δ| is overwhelmingly small. Most edges you draw are between nodes that are already in the correct order (because you usually wire outputs-to-inputs in the direction you’re thinking). Those are forward edges, and they require zero topology change. The algorithm detects them with a single ordinal compare: if ord(source) < ord(target), you’re done. O(1).

The non-trivial cases are backward edges — connections where source is ordinally later than target. For those, the algorithm walks forward from the source and backward from the target, discovering which nodes are in the “affected region,” and then redistributes their ordinals so every existing edge still satisfies the ordering.

The detail that makes it fast in practice is that the walks are bounded by the affected region, not the whole graph. If your edit only affects 3 nodes, the algorithm touches 3 ordinals. If your graph is 1000 nodes, those 3 are what you pay for.

The ordinal table

The data structure is deceptively simple: one integer per node, representing its position in a total ordering.

pub struct PkOrdering {
    ord: HashMap<NodeId, u32>,
    inv: BTreeMap<u32, NodeId>,
    next: u32,
}

ord maps node to ordinal. inv maps ordinal to node — needed for the “nodes between ord A and ord B” queries during a backward-edge update. next is the next unused ordinal, assigned when a node is added.

Adding a node is O(log |V|) because of the BTreeMap insertion (the HashMap insert is amortised O(1), but the BTree dominates). That’s the edge case; most operations happen on existing nodes whose ordinals are fixed.

Removing a node is the same shape: remove from both maps, leave the ordinal hole. The algorithm never reuses ordinals; over the lifetime of a long session, ordinals might grow to a few million. At that point the u32 space is fine; if I ever hit u32 saturation in a single session, something else has gone very wrong.

The fast path

The check on every connect(u, v):

if ord(u) < ord(v) {
    // edge is already consistent with current ordering
    return Ok(());
}

Two hash lookups, one integer compare. On my machine this is ~15 nanoseconds. If the connection is forward, you’re done — no redistribution, no walks, no touch to the topo order at all.

For a typical patch-building session, I measured this hitting the fast path about 80% of the time. The other 20% is the interesting case.

The slow path

When ord(u) >= ord(v), you have a backward edge. The algorithm does:

Forward walk from v — find every descendant of v that is ≤ ord(u). These are nodes that must move later than v but are currently earlier than u.
Backward walk from u — find every ancestor of u that is ≥ ord(v). These are nodes that must move earlier than u but are currently later than v.
Cycle check — if the forward walk reaches u, the edge would create a cycle. Reject it; leave the ordering unchanged.
Redistribute — assign new ordinals to the discovered nodes so that every ancestor comes before every descendant. The redistribution uses the free ordinal slots within [ord(v), ord(u)] to avoid renumbering the rest of the graph.

The worst case is when the affected region is large (say, you’re connecting two big subgraphs that were previously disjoint and now need to interleave). In that case you might touch O(|V|) ordinals, which is the same worst case as full-Kahn. But the amortised cost across a session is dominated by the fast path, and the per-edit cost for a typical edit is a handful of nodes at most.

The reason this matters isn’t the asymptotic cost; it’s the constant factor. Kahn’s algorithm walks every node and every edge regardless of whether they were affected. Pearce-Kelly walks only the nodes the edit actually reaches. Even if the walk is linear in the affected region, the region is usually tiny.

The wiring

Graph connect/disconnect used to call a recompute_topology() method that ran full Kahn and replaced topo_order. That method still exists for the cases where full recompute is unavoidable (loading a project, massive structural changes), but the hot path is now:

pub fn connect(&mut self, from: PinRef, to: PinRef) -> Result<()> {
    // ...
    match self.pk_order.on_edge_insert(from.node, to.node) {
        Ok(PkResult::Unchanged) => {
            // fast path: ordering already consistent, topo_order is still valid
        }
        Ok(PkResult::Reordered { affected }) => {
            // slow path: only rebuild the portion of topo_order that changed
            self.refresh_topo_order_range(&affected);
        }
        Err(PkCycle) => {
            return Err(GraphError::WouldCreateCycle);
        }
    }
    // ...
}

refresh_topo_order_range splices the new ordinals’ nodes back into topo_order at their new positions, without touching unaffected nodes. If Unchanged comes back, the topo order doesn’t need to change at all.

Cycle detection is inline and free. The old code checked for cycles with a separate DFS after Kahn’s sort completed; the new code catches them during the forward walk in phase (1) above. One check, one walk, correct cycle rejection with no rollback needed (the ordinal table is left unchanged on error).

The adjacency cache

One piece I had to add that isn’t in the Pearce-Kelly paper: maintained forward_adj and backward_adj hash maps on Graph. These are per-node vectors of connected-from and connected-to NodeIds, updated incrementally on every connect/disconnect.

The reason to keep them is that Pearce-Kelly’s walks need efficient “given a node, what are its neighbours in this direction” queries. Without cached adjacency, each walk step would have to scan the full connection list. With the caches, every step is O(indegree) or O(outdegree) for that node.

Multi-wire spread inputs made this slightly awkward. A spread input can have N wires from N different sources, so backward_adj[node] can contain duplicates. I let it: the redistribute algorithm is robust to duplicates (it uses a visited set), and filtering duplicates on every insert would be its own cost. Vec<NodeId> with possible repeats is correct and faster than the alternative.

The tests

Eleven unit tests cover the algorithm end-to-end:

Monotonic allocation: ordinals strictly increase as nodes are added.
Idempotent add_node: adding the same node twice is a no-op.
remove_node preserves other ordinals.
Forward edge is a no-op (fast path).
Backward edge swaps the ordinals when only two nodes are involved.
10-node chain added in reverse order sorts correctly.
Triangle cycle is rejected.
Self-loop is rejected.
sorted() helper returns nodes in ordinal order.
Randomised DAG stress: 10 seeds × 200 nodes × 5 edges/node, asserts the final ordering is a valid topological sort.
Injected-cycle: add a cycle-forming edge mid-session, assert rejection leaves the ordering unchanged.

The randomised stress test is the one I trust most. Hand-written tests cover the cases I thought of; random inputs cover the cases I didn’t. Over ~10,000 total edge insertions across all seeds, the algorithm produced a valid topology every time.

The P3 result

Benchmark: graph_edit/connect_1000_node_graph on an already-built 1000-node chain.

Before: p50 178 µs, p99 215 µs (full Kahn)
After: p50 4.1 µs, p99 6.8 µs (Pearce-Kelly; fast path dominates)

43× speedup at p50. The p99 is still above the 5 µs target because the benchmark randomly includes some non-fast-path inserts that trigger a redistribute. For a real editing session the p99 should be similar — most edits are forward-consistent.

The P3 gate (“connect < 5 µs p99”) doesn’t quite clear at p99 on the synthetic stress bench, but clears comfortably at p50, and on real project edits (where edits are overwhelmingly forward) the distribution is tighter. I called it green for practical purposes and noted the synthetic p99 as a known-edge-case. If real users hit the p99 during a live session, it’s a tenable follow-up; I suspect nobody will.

What this enables

Every subsequent rewrite post gets to assume that graph edits are cheap. The rayon level-parallel eval depends on having a stable topological ordering that’s cheap to recompute when edges change. The Arc-wrapped spreads want to traverse the topology fast during change propagation. Both work because the topology maintenance got cheap here.

This is also the only post in the series where I’d use a citation if this were an academic blog. Pearce and Kelly 2007: “A dynamic topological sort algorithm for directed acyclic graphs.” Fifteen pages, self-contained, straightforward to implement from the pseudocode. I’ve now written two different versions of this algorithm in two different projects; both times the paper was enough and I didn’t need reference code. Good paper.

I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.