PinId: The Death of HashMap<String, _>

The previous post cut ~70% of the graph engine’s per-frame string allocations by moving type_name to &'static str. The remaining 30% were hidden in every single pin lookup. This post kills them.

The problem, one more time

Every node’s process() calls ctx.input::<f64>("radius") or ctx.output("out", value). Under the hood, both of those go through HashMap::get(&str). Which means every call:

Hashes the pin name string.
Walks the hash bucket.
Compares bytes to find the right entry.
Returns a reference.

On frame N, hashes “radius”. On frame N+1, hashes “radius” again. The hash is deterministic. The string is the same literal in the source. The pin is in the same position in the node’s pin list. Nothing has changed except the fact that we keep rehashing “radius” as if it might have moved.

For the 1000-node benchmark, that’s about 6,000 hash computations per frame. At 120 FPS, about 720,000 per second. Every single one computing the hash of a string the compiler already knew about at compile time.

There’s no clever runtime cache you can apply here. The fix has to be at compile time: generate a stable integer index per pin, and use that index as the lookup key. The index has to be on the node’s struct as a const, so that every call site is ctx.input::<f64>(Circle::PIN_RADIUS) and the compiler can inline the index into the instruction stream.

Which is what a procedural macro is for.

PinId

pub struct PinId(u16);

impl PinId {
    pub const fn input(idx: u16) -> Self { Self(idx) }
    pub const fn output(idx: u16) -> Self { Self(idx | 0x8000) }
    pub fn is_input(self) -> bool { self.0 & 0x8000 == 0 }
    pub fn position(self) -> u16 { self.0 & 0x7fff }
}

A 16-bit integer. Bit 15 encodes direction (input/output). Bits 0-14 encode position within the respective list, so nodes can have up to 32,768 inputs and 32,768 outputs. Which, you know, should cover most reasonable cases.

The reason to pack direction into the high bit is so that a single u16 compare fully identifies a pin — same direction, same position, match. The alternative would be a tagged enum or a pair of u16s; neither pays its way for what’s essentially a lookup key.

PinId implements Copy, PartialEq, Eq, Hash, and fits in a register. Reading one from the node slot is a single-cycle load. Comparing two is a single-cycle compare.

The pins! macro

The constants are emitted by an extension to the existing #[lux_node] attribute macro:

#[lux_node(
    type_name = "shape.Circle",
    inputs = ["center", "radius", "fill_color", "stroke_color", "stroke_width"],
    outputs = ["layer"],
)]
pub struct CircleNode {
    // ...
}

At expansion time, the macro emits:

impl CircleNode {
    pub const PIN_CENTER: PinId = PinId::input(0);
    pub const PIN_RADIUS: PinId = PinId::input(1);
    pub const PIN_FILL_COLOR: PinId = PinId::input(2);
    pub const PIN_STROKE_COLOR: PinId = PinId::input(3);
    pub const PIN_STROKE_WIDTH: PinId = PinId::input(4);
    pub const PIN_LAYER: PinId = PinId::output(0);
}

pub const items. Resolvable at compile time. Inlined at every call site. The macro does a small amount of name-mangling (center → PIN_CENTER, stroke_color → PIN_STROKE_COLOR) and validates that every name in the macro arguments also appears in the node’s NodeInfo::info() pin list, catching the easy mistake of “I renamed the pin and forgot to update the const.”

The node author now writes:

fn process(&mut self, ctx: &mut ProcessContext) {
    let center: Vec2 = ctx.input_pin(Self::PIN_CENTER);
    let radius: f64 = ctx.input_pin(Self::PIN_RADIUS);
    let fill: Color = ctx.input_pin(Self::PIN_FILL_COLOR);
    // ...
    ctx.output_pin(Self::PIN_LAYER, layer);
}

Same shape as the old ctx.input("center") API. Compile-time checked. Zero runtime hashing.

If you rename the pin in the inputs = [...] list and forget to update the PIN_CENTER reference, you get a compile error at every call site. This was E2 in the manifesto: “pin rename is a compile error at every call site.” Delivered.

The ProcessContext overload

The implementation side lives in four new methods on ProcessContext:

pub fn input_pin<T: FromPinValue>(&mut self, pin: PinId) -> T { ... }
pub fn input_raw_pin(&mut self, pin: PinId) -> Option<&PinValue> { ... }
pub fn output_pin<T: IntoPinValue>(&mut self, pin: PinId, value: T) { ... }
pub fn output_raw_pin(&mut self, pin: PinId, value: PinValue) { ... }

Each one takes a PinId, extracts the direction bit, extracts the position index, and indexes straight into the node’s inputs: Vec<PinSlot> or outputs: Vec<PinSlot> from the slot-map post. Array index. No hash. No bucket walk. No string compare.

A private PinKey trait unifies the two lookup paths (&str and PinId), so the body of ProcessContext::input can dispatch to either. The &str path still exists — every pre-rewrite plugin’s call sites still compile — but the hot path inside the engine uses PinId exclusively.

One subtlety: the &str path goes through NodeInfo::pin_id_input(name), which does a linear scan over the pin list and returns a PinId. The scan is O(#pins), which is small (typical nodes have ~5 pins), and the result can be cached on first call and reused. That’s the migration bridge: legacy code pays a one-time O(#pins) scan per process() call instead of a per-lookup hash. Still not ideal, but dramatically cheaper than the old HashMap::get(&str) path, and it lets plugins migrate at their own schedule rather than as a big-bang.

The ClinGr migration

Migrating in-tree plugins was about ~2,000 lines of change across 50 files, almost entirely mechanical. For each node:

Add inputs = [...] and outputs = [...] to the #[lux_node] macro.
Replace ctx.input::<T>("name") with ctx.input_pin::<T>(Self::PIN_NAME).
Replace ctx.output("name", value) with ctx.output_pin(Self::PIN_NAME, value).
Verify the pin names in the macro match the pin names in info().

The verification step was important. Two nodes had a typo in their macro args (“stoke_color” where info() said “stroke_color”). Both compiled before the migration because the macro and the info() function didn’t talk to each other. Both compiled after the migration because the macro was still additive and didn’t enforce consistency with info(). I added a validation step to the macro that cross-checks its inputs = [...] list against the node’s info() at expansion time, but that’s a future change; for now it’s a test that iterates every registered node and asserts the two lists match.

The counted-alloc test

The P6 gate from the manifesto was: “zero string allocations per frame at 1000 nodes.” The way to prove that in a test is a counted allocator — a custom global allocator that tracks every alloc and dealloc call, and a test that runs the evaluator for one frame on a 1000-node graph and asserts the alloc count didn’t rise.

tests/graph_eval_no_string_alloc.rs does exactly that. It builds a 1000-node linear-chain graph, warms up (first frame allocates — NodeInfo caches populate, Vec capacities grow), then runs ten frames and asserts that each post-warmup frame shows ≤ 1 String allocation (the 1 is slop for the log pipeline; I’d prefer 0 and I think there’s one left somewhere in the log formatting that I haven’t tracked down).

Before the rewrite: ~8,000 allocations per frame at 1000 nodes, after warmup. After this post: 1 allocation per frame. The 1-to-0 gap is what the last few allocations post closes, but getting to 1 was the big structural win.

P7 (“zero pin-name hash ops per second”) is the companion assertion. It’s harder to measure directly (Rust’s allocator hooks don’t cover hashing), so the test asserts it indirectly: the evaluator is built so that the hot path has no HashMap<&str, _> lookups at all. Any regression that reintroduces one trips a different test that counts PinId::from_name calls per frame.

What I got wrong the first time

This was the second attempt at the macro. The first attempt used a pins!("radius", "center", ...) function-like macro that you called inside the struct impl block:

impl CircleNode {
    pins! {
        inputs: ["center", "radius", ...],
        outputs: ["layer"],
    }
}

That worked. It was also awkward to keep in sync with the #[lux_node] attribute — you’d have two separate macro invocations that both had to list the pin names, and rename bugs were easy. Merging into the existing #[lux_node] macro made the attribute the single source of truth for pin names, constants, and (eventually) the info() registration.

Might still be wrong. There’s a version of this where the pins get declared on the struct fields themselves (#[pin] center: Vec2), and the macro reads the struct definition to know about them. That would be even more DRY and would open up automatic FromPinValue/IntoPinValue wiring. I didn’t ship that because the struct-field version requires field ordering to match pin ordering, which is a constraint I’m not sure I want, and because the argument version was already working. Will probably revisit when the macro ecosystem gets more consistent.

The cumulative picture

After this post, the dirty-gate + slot-map + PinId stack gives us:

P6: 0 string allocations per frame (down from ~8,000) ✓
P7: 0 pin-name hash ops per second (down from ~1M) ✓
P5: unchanged-subtree skip: 8 µs for 900/1000 (well below the 10 µs gate) ✓
P1: 1000-linear eval: 1.1 ms (still above the 1.0 ms target, needs parallel eval from two posts ahead)

Three of the seven P-gates are green already. The remaining four are the more algorithmic ones — Pearce-Kelly for connect ops, level-parallel eval for fan-out, Arc-wrapped spreads for fan-out-of-spreads, and the value-unchanged gen-bump for change detection.

That’s the next four posts, but not in the order you might expect. The app decomposition comes first because the algorithmic work needed a cleaner app structure to land against.

I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.