SlotMap + Generation Counters

Last post was the manifesto. This one is the first actual commit of the rewrite: the new node storage layout.

The change looks small on paper — replace seven hash maps with one slot map — and felt like surgery in practice. It’s the foundation every subsequent post in the series builds on, so it’s worth walking through carefully.

What was there before

Graph stored its nodes across seven parallel HashMaps, all keyed by NodeId:

pub struct Graph {
    nodes: HashMap<NodeId, Box<dyn Node>>,
    node_inputs: HashMap<NodeId, HashMap<String, PinValue>>,
    node_outputs: HashMap<NodeId, HashMap<String, PinValue>>,
    prev_inputs: HashMap<NodeId, HashMap<String, PinValue>>,
    node_info_cache: HashMap<NodeId, NodeInfo>,
    always_dirty_nodes: HashSet<NodeId>,
    type_names: HashMap<NodeId, String>,
    // ...
}

Every node lookup hit seven cache lines. Every iteration walked seven separate hash tables. Adding a new piece of per-node state meant adding an eighth. Forgetting to update one of them on node removal was a bug waiting to happen.

The other problem: NodeId is a u128. It was a u128 because I wanted IDs to be UUID-compatible so that collaborative editing could eventually work without ID collisions across machines. That’s the right choice for the public identity. It’s a terrible choice for the internal hot-path key, because a u128 key means every hash bucket lookup does 16-byte comparisons instead of 4-byte ones.

SlotMap to the rescue

slotmap::DenseSlotMap is a Rust crate that implements a keyed arena with fast iteration and cheap insertion/removal. Internally it stores its values in a dense Vec<T> plus a sparse Vec<SlotIndex> that maps keys to positions. Insertion is O(1) amortised. Removal is O(1). Iteration walks the dense Vec in cache-friendly order, skipping nothing because the dense vec holds only live entries.

The tradeoff is that the key isn’t stable the way a UUID is — the library reuses freed slots. To prevent the classic use-after-free bug where a key points at a slot that’s now a different value, SlotKey carries a 32-bit generation tag alongside the index. When a slot is freed, its generation increments. A key with an old generation fails the lookup gracefully.

For Lux, this translates to a clean two-layer identity:

// Public — serialisable, UUID-compatible, survives save/load
pub struct NodeId(u128);

// Internal — fast, arena-indexed, never serialised
type SlotKey = slotmap::DefaultKey;

Every Graph operation takes a NodeId at its public boundary. The boundary translates once, calls into the arena using SlotKey, and translates back on the way out. The hot paths (eval loop, dirty propagation, topology walks) never see the u128.

I’m pretty sure this is the right split. The alternative — making the internal key also a u128 — would be simpler to reason about but would leave the hash-bucket cost on the hot path for no real gain. The alternative-alternative — making NodeId itself a SlotKey — would be fastest but would leak the arena’s reuse semantics into every save file, which would be catastrophic.

NodeSlot

The seven hash maps collapse into one NodeSlot struct:

pub struct NodeSlot {
    pub node: Box<dyn Node>,
    pub info: Arc<NodeInfo>,         // cached from node.info()
    pub type_name: &'static str,     // cached from info.type_name
    pub inputs: Vec<PinSlot>,        // indexed by PinIdx
    pub outputs: Vec<PinSlot>,       // indexed by PinIdx
    pub flags: NodeFlags,            // bitflags
    pub edit_gen: u32,               // bumps on any mutation to this node
    pub eval_gen: u32,               // last eval that touched outputs
}

One cache-line-friendly record per node. Iterating the graph is arena.values(), which walks the dense backing vec in order. Lookups are arena-key indexed, which is ~4 cycles on any modern CPU.

Two details worth pointing out.

info is Arc<NodeInfo>, not a direct field. NodeInfo is ~300 bytes when it’s populated with descriptions, tags, see_also, and pin docs. The evaluator reads it dozens of times per frame per node. Cloning 300 bytes × 1000 nodes per frame would be 300 KB of memcpy we don’t need. Wrapping in Arc means node_slot.info() is a refcount-bump-equivalent read, and the 300 bytes live at one address.

type_name is &'static str. The type name comes from a static string literal in the node’s source file. There’s no reason to heap-allocate it. Moving it to &'static str was a one-line change that removed ~1,000 string allocations per frame at 1000 nodes. I’d been carrying the owned-string version since day one for no reason other than “I wrote String and didn’t revisit.”

NodeFlags

The always_dirty_nodes HashSet from the pre-rewrite layout was a single bit of per-node state stored as a set-membership check. Every frame, the evaluator called contains(&node_id) for every node to decide whether to mark it dirty. HashSet::contains isn’t expensive, but doing it 1000 times per frame is 1000 hash computations we don’t need when the information fits in one bit.

NodeFlags is a bitflags! struct, one u32 per node:

bitflags! {
    pub struct NodeFlags: u32 {
        const DIRTY_MANUAL   = 1 << 0;  // user-set; cleared after eval
        const ALWAYS_DIRTY   = 1 << 1;  // from NodeInfo (Time, Mouse, etc.)
        const STATEFUL_GPU   = 1 << 2;  // holds GPU handles; mark-in-use contract applies
        const ERRORED        = 1 << 3;  // last eval panicked
        const SPREAD_NATIVE  = 1 << 4;  // handles spreads internally; don't auto-iterate
    }
}

Reading or writing any flag is a single integer op. Testing membership is a bit mask. Packed into one u32, all five flags live in a single cache line along with the rest of the NodeSlot, so flag reads pay zero extra cache-miss cost.

The seven parallel hash maps are now literally gone. always_dirty_nodes is a flag bit. errored_nodes from the error-dots post is a flag bit. Stateful-GPU tracking is a flag bit. All three are reads from the same u32 that I need to look at anyway to check dirtyness. Net pattern: per-node state goes in the slot, not in a sidecar.

Generation counters, the dirty-gate fix

This is the piece I’m most pleased with.

The pre-rewrite dirty check was:

if node.inputs != prev_inputs[&node_id] {
    // run node
}

Which deep-compares every input value on every frame. For a node reading a scalar, that’s cheap. For a node reading a spread of 10,000 elements, that’s 10,000 PinValue::eq calls per frame just to decide whether to run the node, which at that point you may as well have just run.

And prev_inputs is itself a deep clone of the last frame’s inputs, which is another ~10,000 clones for the same spread.

The new approach is per-pin generation counters. Every PinSlot stores a u32 gen that bumps whenever the pin’s value actually changes. Not “might have changed” — changed, as in, the new value doesn’t equal the old one, and we bumped the gen because of that. Downstream nodes compare the current gen to the gen they saw last frame:

pub fn should_evaluate(&self, prev_gens: &[u32]) -> bool {
    if self.flags.contains(NodeFlags::ALWAYS_DIRTY)
        || self.flags.contains(NodeFlags::DIRTY_MANUAL) {
        return true;
    }
    self.inputs.iter().zip(prev_gens).any(|(slot, prev)| slot.gen != *prev)
}

Per-pin, integer compare, branchless in the hot path. A spread of 10,000 elements with unchanged contents reports an unchanged gen — one integer compare — and the dirty check says “no.” Zero deep compares. Zero clones. Zero traversal of the underlying spread.

The piece that makes this work cheaply is the gen-bump rule: only bump gen when the value actually changes. Which sounds circular, but the spread-becomes-Arc post introduces an O(1) PinValue::eq_cheap that makes the gen-bump decision itself cheap even for huge spreads.

Stacking all of that gets you the P5 gate from the manifesto: “zero compares for unchanged subtree.” A 1000-node graph where only 10 nodes changed runs those 10 nodes and skips the other 990 after per-pin gen-counter compares alone.

collect_changed_pins_into

Plugins have a ProcessContext::changed(pin_name) -> bool helper that tells them which inputs differ from last frame. A Spring node uses it to reset its velocity when the target gets snapped. A StateMachine uses it to detect transition triggers.

The old implementation deep-compared prev_inputs[pin] against the current input. The new implementation reads the pin’s gen and compares it to the one recorded at the start of this node’s process():

pub fn collect_changed_pins_into(
    &self,
    prev_gens: &[u32],
    out: &mut SmallVec<[PinIdx; 4]>,
) {
    out.clear();
    for (idx, (slot, prev)) in self.inputs.iter().zip(prev_gens).enumerate() {
        if slot.gen != *prev {
            out.push(PinIdx(idx as u32));
        }
    }
}

Zero allocations. SmallVec<[PinIdx; 4]> holds up to four changed pins inline; nodes with lots of pins spill to the heap but most don’t. The buffer is owned by the ProcessContext and cleared-and-reused across every process() call within a frame. Same pattern as the zero-alloc eval loop.

The migration

Actually shipping this is two commits. The first (c7aa2b4) adds the new module (arena.rs) alongside the old code without touching anything. The second (847174d) migrates graph.rs and eval.rs to use it, deleting the seven-map layout in one go.

I tried doing this as a progressive migration — add the slot map, move fields one at a time, delete the old ones piecewise — and it didn’t work. The data structures are too entangled; partial state on either side crashes constantly. Big-bang migration with the regression suite as the safety net turned out to be the right answer, even though every instinct says big-bang migrations are bad.

Regression suite passes. Benchmarks run. The bench baseline bumps with the new numbers recorded in my design-doc margin. The seven-map layout is gone.

What got faster

With just the slot-map layout change (no algorithmic improvements yet — those are the next posts), the benches move:

graph_eval/1000_linear p50: 2.3 ms → 1.6 ms (-30%)
graph_eval/1000_fanout p50: 1.8 ms → 1.3 ms (-28%)
String allocations per frame at 1000 nodes: 8,000 → 2,300 (only the type_names dropped; the rest needs the PinId work)
Dirty-gate unchanged-subtree skip: 900/1000 nodes → 2.1 ms → 120 µs (-94%)

That last one is the big one. The dirty-gate improvement alone almost hits the P5 target (“zero compares unchanged subtree, ≤ 10 µs for 900/1000”), and there’s more gen-counter work coming that’ll push it below 10 µs.

The string-allocation number is still terrible (2,300 per frame), because every pin lookup still goes through HashMap::get(&str). That’s the next post’s problem.

I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.