ResourceKind Becomes a Closed Set

The framegraph shipped a year ago with exactly one resource variant: Texture2D. There was a comment in the source that said the others were coming. The comment was load-bearing. It was the only thing standing between the framegraph and a future where someone would try to allocate a depth target through it and discover that the only way to do that was to lie to the compiler about what they were doing.

That future arrived a few months ago, when the shadow work needed real depth targets. The workaround at the time was a 1×1 Texture2D sentinel that the consumer pretended was something else. It worked. It also made me feel weird every time I read past it.

This post is the variants finally arriving, plus a pile of related cleanup that became possible once they did.

ResourceKind

Closed enum, seven variants:

pub enum ResourceKind {
    Texture2D { width, height, format, ... },
    Depth { width, height, format, samples },
    Cube { face_size, format, mip_count },
    Texture3D { width, height, depth, format },
    TextureArray { width, height, layers, format },
    Msaa { width, height, format, samples },
    Buffer { byte_size, usage },
}

The aliaser keys on a ResourceKindTag derived from the discriminant. Cross-tag aliasing is statically impossible. A Texture2D slot will never get reused for a Buffer, a Cube for a Depth. Within a tag, the existing free-list allocator handles size and format matching the same way the texture pool always did.

The wildcard_enum_match_arm clippy lint is now denied at the framegraph crate root. When a new variant gets added (and a new variant will get added, because I already know I want a sparse-volume tag), every match site in the codebase has to add the explicit arm. No silent defaulting. No “oh I forgot to handle that case and the compiler didn’t tell me” stories at 11 PM.

QueueAffinity and EncoderDispatch

Once you have multiple resource kinds, the next obvious thing is multiple queues. wgpu exposes Graphics and Compute as logically distinct queues. On hardware that supports it, dispatching compute work to the compute queue lets it run alongside graphics work on the graphics queue. On hardware that doesn’t (mobile GPUs, llvmpipe, some integrated chips), they’re the same queue and the affinity tag is just metadata.

QueueAffinity::{Graphics, Compute} tags every pass. Defaults to Graphics. Eight existing helpers got the Compute tag where appropriate: bloom downsample/upsample, IBL bake, Hi-Z build, denoise, meshlet cull, cluster-light bin, skin compute, the VSM page-mark.

EncoderDispatch is the trait that routes a tagged pass to the right encoder. Single-queue adapters return the same encoder for both affinity values. The trait’s has_distinct_encoders() predicate tells the executor whether to bother with the routing logic at all. Behaviour on llvmpipe is identical to the pre-routing world.

Cross-queue barriers

The interesting part. When a pass on the compute queue produces a resource that a pass on the graphics queue consumes, you need a barrier. wgpu’s auto-tracker can’t see across queues. The framegraph’s BarrierPlan has known about these edges for a while. The new question was which encoder records the barrier.

Two choices. Either the source encoder records “I’m releasing this to the dest queue,” and the dest encoder picks it up via a queue signal/wait. Or the dest encoder records “I’m acquiring this from the source queue” before its first use. Both are valid. They produce identical results on hardware that supports timeline semaphores. They produce different results on llvmpipe, where the queue is the same and the test harness needs to verify the behaviour.

The framegraph routes to the destination encoder. Reason: it’s the encoder whose pass actually needs the resource ready. There’s a test that proves this with pointer identity:

let dest_encoder = dispatch.encoder_for(QueueAffinity::Graphics);
// ... record cross-queue barrier ...
assert_eq!(barrier_recorded_on as *const _, dest_encoder as *const _);

If you change the routing direction by accident, the pointer compare fails. Which it should.

The shape on the books today is cross_queue: bool on PassEdge, which is enough for llvmpipe single-queue testing where both encoders are the same encoder. Real timeline-semaphore widening (separate QueueSignal/QueueWait barrier kinds) lands when the test harness has a real multi-queue GPU under it. The framegraph is shaped for that widening; it just doesn’t need it yet.

texture_engine.rs

If you’ve never written a god-class file, texture_engine.rs was 2,688 lines of impl TextureEngine. It executed every texture op, dispatched every shader, owned every pool reference, ran every uniform write, brokered every readback, cached every descriptor. The class started as one method and grew. By the time it hit 2,688 lines I had stopped opening it for any reason that wasn’t load-bearing, because the file took noticeable wall-clock time to scroll through.

Carved it up. Free functions extracted first (textops_run.rs, textops_alloc.rs, textops_readback.rs, and so on), then the impl block split into 14 focused modules under engine/, all under 800 LOC each. The largest is execute.rs at 776 lines, which is still big but coherent.

The rule going forward: no file in engine/*.rs may exceed 800 LOC. CI checks this with a script. When a module starts to feel god-shaped again, it gets carved before merge.

The carve itself is structural. Zero behaviour change. Benchmark numbers byte-identical pre and post. The reason to do this work is so the next person who has to touch the texture engine (often me, six months from now) doesn’t have to scroll through 2,688 lines to figure out what an Alloc { handle, desc } op does.

scene_bloom hits the cache 100% of the time

The previous post covered how five post-process passes started hitting BindGroupCache. scene_bloom is the headline. After the migration, it runs at 100% steady-state hit rate on the 15-frame fixed-size benchmark. Resize storms (where the output dimensions change every frame) drop it briefly, which is correct: when the dimensions change, the cache key is different, so the previous entry isn’t a hit.

That’s the gate I want every future hot-path commit to enforce: at least 97% steady-state BindGroupCache hit rate on the patches that exercise it. scene_bloom is the existence proof that the threshold is achievable.

The two SSBO sentinels

When the variants landed, the bridge that turns a ResourceKind into a real wgpu resource (the framegraph_bridge.rs adapter) needed wiring for each one. Six of seven are wired. The seventh (Buffer) currently rejects at the bridge with a “deferred” message, because the only consumers that need it are the meshlet-cull completion handle and the skin-pass completion handle. Both of those are part of a wider mesh-path rewrite I’d rather land coherently.

Until then, the meshlet-cull and skin-pass keep minting a 1×1 Texture2D sentinel for their SSBO outputs. This is the pattern the framegraph existed for the last year without having a real Buffer variant to replace, and it’s the pattern that disappears when the bridge wiring lands. Two surviving sentinels in the whole codebase. Not great. Better than the alternative of landing the bridge wiring without the consumers being ready, which would invert into “Buffer exists but nothing routes through it.”

The rule for this is “no third sentinel.” If anyone tries to add a fourth, the review blocks it. Two is the max. Two are budgeted. Two go away when the meshlet path lands.

Rails, not the train

There’s an honest framing I want to use here, because the alternative is making a performance claim that isn’t true yet.

Every QueueAffinity::Compute tag is currently advisory. In production, the live mesh-path orchestrator submits everything through a single wgpu::CommandEncoder regardless of the tag. The async-compute test suite proves the routing mechanism is correct end-to-end (cross-queue barriers, parity, dispatch) under the test harness’s run_with_dispatch path. The live orchestrator still calls run_with_hook with one encoder.

The flip from run_with_hook to run_with_dispatch lands when the live orchestrator’s first compute-affinity pass has a reason to live on its own queue. That’s a later post. The graphics-queue bubble-reduction you’d expect from concurrent compute (typically 8 to 18 percent on bloom-heavy or IBL-bake-heavy frames) is not measurable today, because the rails exist and nothing rides them.

This post shipped the rails. The train rides them later.

What this unblocks

Six of seven resource kinds are routable. Closed sum types. Per-tag aliasing. Cross-queue barriers verified by pointer identity. A 2,688-line god-file no longer exists. scene_bloom hits the cache. The wildcard match lint is denied so the next variant won’t sneak in via a defaulting branch.

The next post is about the live mesh path finally importing the PBR module that’s been sitting next to it for months, and the white-furnace energy preservation gate that fell out of doing so. There’s a chrome sphere at roughness 0.5 that has been losing 12 to 18 percent of its incident energy, and I haven’t been noticing because the rest of the scene was bright enough to cover for it. That stops next post.


I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.

← Back to devlog