The Train Rides the Rails

ResourceKind Becomes a Closed Set ended with a section called “Rails, not the train.” The framegraph had learned queue affinity. Every pass could be tagged Graphics or Compute, and the executor knew how to route a tagged pass to the right encoder. But the live render orchestrator still recorded everything into a single command encoder. The routing worked in the tests. In production it was a tag that described an intention and changed nothing.

That post said: “This post shipped the rails. The train rides them later.”

This is later.

FrameEncoderSplit

The change is small at the call site and load-bearing underneath it. Per-frame command recording now allocates a pair of encoders instead of one: a graphics encoder and a compute encoder, bundled into a FrameEncoderSplit. Every framegraph orchestrator on the live mesh path (bloom, the framegraph chain, the Render3D dispatch, post, environment export) stopped calling GraphExecutor::run_with_hook, which takes one encoder and does no routing, and started calling run_with_dispatch, which does.

run_with_dispatch reads each pass’s QueueAffinity tag and records it onto the matching encoder. At submit time the frame issues a single queue.submit([compute_cb, graphics_cb]): two command buffers in one call, which is the hint a driver needs to overlap the compute work with the graphics work that follows it.

The Caps for this, meaning whether the adapter actually exposes a distinct compute queue, got plumbed through TextureEngine so the engine knows at construction time which world it lives in.

Single-queue adapters change nothing

This is the part that made the flip safe to land.

llvmpipe, the software adapter every CI test runs on, has one queue. So do Metal, OpenGL, and WebGPU 1.0. On all of them FrameEncoderSplit::has_distinct_encoders() returns false, both affinity variants resolve to the same graphics encoder, and the submit collapses back to a single [graphics_cb] call. Behaviour is bit-for-bit identical to the pre-flip single-encoder path.

So the live flip is a no-op everywhere CI can see it, and a real change only on dual-queue hardware: Vulkan and D3D12 adapters with a dedicated compute queue. QueueAffinity::Compute went from a label that described a hope to a tag that actually moves work onto a second queue. The async-compute parity test proves the two paths produce byte-equal output: same compute pass, run with the split on and with it off, pixels identical.

Buffer, end to end

The other half of this post is a deletion, and it closes something the ResourceKind post left open and a little embarrassing.

That post shipped a closed ResourceKind sum type (Texture2D, Depth, Cube, Texture3D, TextureArray, Msaa, Buffer) but admitted the Buffer variant wasn’t actually wired through the bridge that turns a resource kind into a real wgpu allocation. Two passes needed a real GPU buffer for their output: the meshlet cull pass, for a draw_count, and the skin compute pass, for a deformed_pos array. With no Buffer route, both were minting a 1x1 Texture2D sentinel: a fake one-pixel texture whose entire job was to be a handle the framegraph could track. It was the rendering equivalent of holding a place in a queue with a traffic cone. The ResourceKind post called this out and said adding a third one would be a review-block.

The fix is import_buffer, the Buffer companion to the framegraph’s existing import for textures. A caller-owned, long-lived SSBO can now be handed to the framegraph as an ImportedBuffer, the same way a long-lived texture is imported. BarrierPlan learned to tell a buffer transition from a texture transition and emits real buffer_transitions for Buffer-kind resources, where before it shoved everything onto the texture side and left the buffer resolver empty.

Both sentinels are gone. The meshlet-cull and skin-compute passes import_buffer their real SSBO, write it, and export it so the framegraph’s reachability pass keeps the writer alive. Two fake textures deleted, one grep gate added so they can’t sneak back.

What’s measurable, honestly

The flip is live. The train is on the rails. What I can’t show you yet is the speedometer.

The payoff of async compute, overlapping a bloom downsample or an IBL bake with the graphics work of the next frame, is a graphics-queue bubble reduction, and it only happens on hardware with a real second queue. llvmpipe doesn’t have one, so CI measures exactly nothing here. The honest claim is the one the ResourceKind post set up: the mechanism is correct, proven byte-equal under llvmpipe, and load-bearing on dual-queue silicon. The number, how much frame time it actually buys on a real GPU, gets measured when the real-hardware bench runner lands. I would love to tell you it’s 14%. I have no idea if it’s 14%.

But the structural debt is closed. Every QueueAffinity::Compute tag in the framegraph now does something. The sentinels are gone. The framegraph’s resource algebra has no asterisks left on it.

Next post starts closing the meshlet contract, and it opens with the ugliest confession in it. The skinning fraud, closed.


I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.

← Back to devlog