PerfGuard, Hitch Capture, Crash Sandbox

Live performance has one hard rule: the show cannot stop. If a node panics mid-show, the graph keeps running. If the GPU device goes away, the app recovers. If a WGSL shader edit has a typo, the other shaders keep working. If the GPU spends two seconds on one pass, the rest of the frame still finishes. Every single one of those failure modes has to degrade gracefully instead of crashing.

This post is the three subsystems that make it possible. None of them are individually exciting; collectively they’re the difference between “Lux is fun to demo” and “Lux is something I’d trust with a show.”

PerfGuard

The most ambitious of the three. PerfGuard is an eight-step fallback ladder that degrades render quality progressively when the frame budget is exceeded. The ladder is:

GiQualityLow — ReSTIR GI drops to its low-quality setting.
ReflectRtToScreen — real-time reflections fall back to screen-space reflections.
AoRtToHorizon — ray-traced AO falls back to horizon-based AO.
CloudsQuarterRes — volumetric clouds render at quarter resolution.
ShadowsVsmPcssToVsmPcf — shadow softness drops from PCSS to PCF.
RenderScale1To08 — render at 80% resolution and upscale.
DisableSplat — disable the splat-based GI gathering.
EmergencyNativeTonemapPresent — bypass the full post-FX stack; emergency fallback to a native sRGB present.

The ladder is ordered roughly by “least visible degradation” to “most visible degradation.” Dropping GI quality is subtle. Bypassing the post-FX stack is extremely visible. You climb the ladder one step at a time when the frame budget is exceeded for 3 consecutive frames, and you release back down one step at a time when the budget is comfortable for 60 consecutive frames.

The 3-over-budget / 60-under-budget asymmetry is deliberate. Climbing fast protects against real performance cliffs. Releasing slow prevents oscillation — a patch that’s right on the edge of the budget would otherwise bounce between “step 2” and “step 3” forever, and the visual effect of that toggling is much worse than just staying one step higher.

A 30-frame hysteresis lockout prevents a single-spike frame (a stuttered 30ms frame on an otherwise healthy pipeline) from triggering an unnecessary climb. The frame-over-budget counter only increments on consecutive overages; any under-budget frame resets it. Real performance cliffs stay over budget for many frames; transient spikes don’t.

The subsystem emits a PerfGuardEvent stream: Degraded(step), Recovered(step), PresetRefused(reason), TripwireAtCeiling. The status bar shows a PerfGuardLevel chip — Full (no degradation), Reduced (some degradation active), Survival (at the end of the ladder). The user sees the current level and can make a decision: drop a node, simplify a shader, lower a resolution pin.

The refuse_preset path is the other half. When the user tries to apply a render preset (via the preset system landing in a future post) that the current GPU can’t handle (ceiling of 1.3× adapter budget), the preset is refused before it activates and the status bar shows a PresetRefused chip. Better than applying it and immediately degrading.

I’m pretty sure the eight-step ordering is right — I ranked the steps by subjective perceptual impact on a handful of test scenes — but the ordering is inherently fuzzy. “GI quality low” vs. “SSAO to horizon AO” isn’t a clear winner in every scene. If real users find a specific ordering that feels better, the ladder is easy to reorder.

Hitch capture

A hitch is a frame that exceeded the budget by enough to be visible. Not a frame that’s 1ms over — those are part of life. A frame that spiked to 2-3× budget, stalled for a quarter second, caused the visible judder that the audience noticed. Those.

Post-mortem analysis of hitches is essential. You can’t debug “it felt weird during the second chorus” in real time. You need a record of what was over budget, when, and how. The hitch-capture ring holds the last 32 hitches — budget, timestamp, per-subsystem CPU breakdown, GPU-side markers if available, the active PerfGuard step. Push is O(1) atomic. Ring reuses its slots; the 33rd hitch overwrites the 1st.

Two keybindings handle the workflow:

F11 pins the most recent hitch. A pinned hitch doesn’t get overwritten.
F12 exports every pinned hitch to ~/.lux/hitches/hitch_<iso>.json.

During a rehearsal or tech check, you run the set. When something feels wrong, you hit F11 immediately (it pins the most recent hitch, which is what you just felt). At the end of the set, F12 exports all pins to a JSON file you can analyse later.

Optional tracy integration: if you build with the tracy feature, every hitch plots as a lux-hitch-ms event in tracy’s timeline. Tracy is an in-depth profiler that’s overkill for most development but the standard tool for live-performance profiling. When you’re trying to answer “why did that single frame spike to 50ms?”, tracy’s deep-dive view is the right place, and having hitches pre-plotted saves you time.

I iterated on the ring size a few times. 8 slots lost hitches too fast during a long show. 128 slots used memory for nothing — the ones you care about are recent. 32 is the experimental sweet spot for a 2-hour set: enough slack to find the one you missed, small enough to scroll through. I could be wrong about 32.

Crash sandbox

Three separate crash-recovery mechanisms, each for a different failure mode.

EvalPanicCatcher

A node’s process() can panic. Bad input, out-of-bounds access, divide-by-zero in a shader that doesn’t have the guards it should. The evaluator has been catching these with catch_unwind since zero-alloc eval; the new piece is that the evaluator now retains the last-good output of a panicking node and emits it downstream.

So if a node panics on frame 5000 but worked fine on frame 4999, downstream nodes see frame 4999’s output on frame 5000. The graph doesn’t stall. The visual doesn’t blink. The only observable effect is that the panicking node stops animating — its output is frozen at its last known good state — and the red error dot from the polish post lights up on the canvas so the user can see which node failed.

If the node recovers on a later frame (panic was transient), it resumes normally and the error dot clears. If it keeps panicking, the dot stays lit and the user can identify and fix the problem. Either way the show keeps running.

DeviceLostRecovery

wgpu’s DeviceLost is a real error condition. A driver crash, a TDR on Windows, a GPU hot-plug, any of a hundred reasons the underlying device can disappear out from under the app. The default behaviour is a panic; the production behaviour has to be recovery.

DeviceLostRecovery does a 3-attempt / 2-second-budget retry: create a new device, re-create all pool textures by name from the previous pool’s snapshot, re-register every shader pipeline, re-register every bind group. If all three attempts fail, the app shows a full-viewport red banner (“GPU recovery failed — please restart”) rather than crashing.

Three attempts because device loss is often a transient driver glitch that resolves if you wait and retry. Two seconds because any longer and the audience notices. Either the device comes back in under 2 seconds and the show resumes with maybe a blank second, or it doesn’t and no amount of further waiting helps.

The pool-snapshot piece is where most of the code lives. On device loss, the TexturePool, MeshPool, and BufferPool each serialize their live entries’ descriptors (not the GPU data, which is gone) keyed by handle. On recovery, the new device re-allocates each entry by replaying the descriptors. Handles stay stable across the recovery; nodes that cached handles don’t have to re-upload. The GPU data is gone, but the first frame after recovery triggers the stateful-GPU nodes’ mark_in_use-style re-upload pattern, so by the second or third frame the state is back to normal.

validate_wgsl_async

When a user edits a WGSL shader in the PixelShader or ComputeShader node, the edit might not parse. Or it might parse but fail to compile under wgpu’s constraints. Either failure, done synchronously on the edit, blocks the UI thread for the duration of the validation — potentially tens of milliseconds for a complex shader.

validate_wgsl_async offloads shader validation to a background thread. naga (wgpu’s shader compiler) is thread-safe for parsing; run the parse + validation on a background thread and post the result back to the UI. The UI stays responsive; the validation error (if any) shows up a few frames later with a red error marker on the node. If the shader is valid, the PipelineCache compiles it on the next frame and the node starts using the new shader.

For PixelShader / ComputeShader nodes that users edit interactively, this is the difference between a snappy “type, see result” loop and a juddery “type, wait for the app to respond, see result” loop.

BeatContext

One other piece landed in this session that’s mostly a plumbing change but will matter for live-performance beat-sync features going forward.

A new BeatContext type on FrameContext carries beat-clock state: current tempo, current beat phase, clock source (LTC, MIDI Timecode, Ableton Link, tap tempo, or system clock). Every ProcessContext now exposes ctx.beat() returning the current BeatContext. Every node that wants to sync to a beat — metros, LFOs, pulses, timelines — can now do so by reading ctx.beat().phase instead of rolling its own timing.

The lux-live::beat_context module keeps a MasterClock that manages the actual clock sources: it prefers LTC (tightest sync), falls back to MIDI Timecode, then Ableton Link, then tap tempo, then system clock. An IIR phase filter smooths out jitter from noisy sources. Clock source changes (say, losing LTC signal mid-show) degrade gracefully — the next tick just comes from the next-priority source, and the IIR filter prevents visible jumps.

Zero plugin call-site changes required. Nodes that want beat-aware behaviour opt in by reading ctx.beat(); nodes that don’t want beat awareness ignore the new field. The FrameContext literal-construction sites (five of them, in lux-app and lux-mcp) needed a ..Default::default() spread to absorb the new field, which is a mechanical migration the compiler flags for you.

The combined contract

With all three subsystems in place, Lux’s live-performance contract is:

If a node panics, downstream freezes at last-good state. The graph keeps running.
If a shader fails to compile, the old pipeline keeps running until the new shader validates. No blank output during edits.
If the GPU device is lost, a 3-attempt / 2-second recovery runs. Worst case: 2 seconds of blank output followed by resumption.
If the frame budget is exceeded, PerfGuard progressively degrades quality rather than drop frames. The show reduces in visual fidelity but keeps running.
Every hitch is captured to the ring. F11 pins it. F12 exports it. Post-show analysis is a JSON file away.
Every node time-synced to external clock (LTC, MTC, Link) stays synced through source changes.

None of these subsystems fires in a healthy patch. All of them exist for the one time in a six-hour show where something goes wrong, and the difference between “the audience noticed something weird for 2 seconds” and “the app crashed on stage” is whether these systems were there.

The last live-performance post follows this one: ReSTIR and denoise. Which is about the rendering side of live work — real-time global illumination, not crash recovery. If you’ve read all the way here, thank you. The next post is shinier.

I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.