Async Readback and the StagingBelt

This is the second time I’ve written a post called “why your frame times aren’t what you think they are”. The first one was on the CPU side. The second one was on the GPU-texture-engine side. This one is on the render-submission side, which turns out to have had two structural issues I’d been politely ignoring for months.

The symptom was that frame times wouldn’t go below about 14 ms on the 10k-cubes demo even on hardware that should do it in 6-7 ms. The profiler blamed “GPU wait.” That’s not a real answer. This post is me finding out what “GPU wait” actually meant.

The sync readback problem

Readback in wgpu is the operation of getting pixels back from a GPU texture to the CPU. Any node that needs to see GPU results — ExportImage, --dump-frame, and, crucially, the TextureToLayer bridge that lets the 2D Vello backend draw textures — has to do one.

Readback is async by nature. You issue a copy from the texture to a staging buffer, submit it, and at some later point (not necessarily this frame) the staging buffer becomes mappable. To read it on the CPU, you map it, which requires the device to have finished the copy. The standard way to wait for it is device.poll(PollType::Wait) — block the current thread until all outstanding work has completed.

Blocking sounds fine. It’s a waiter. But here’s what device.poll(Wait) actually does: it waits for every outstanding GPU operation, not just your copy. That includes the rendering of the current frame, the prefilter passes from your IBL bake, the mesh uploads from every primitive that touched its inputs this frame, and whatever’s been queued up since the last submit. If you call poll(Wait) mid-frame, the CPU sits and waits for the GPU to catch up on everything.

Worse, poll(Wait) torpedoes triple-buffering. The whole point of triple-buffering is to let frame N+1 start submitting while frame N is still on the GPU. poll(Wait) blocks until the GPU is quiet, which means you’re not pipelining. You’re effectively single-buffered.

I had readback_texture_sync being called every frame that emitted a DrawTexture command. Every single frame, the CPU was pausing for the GPU to finish the frame, mapping the buffer, handing it to Vello for the image cache, and then kicking off the next frame. That’s not 120 fps. That’s 60 fps at best, and only if the GPU happens to be fast enough to finish a frame in 16 ms. I think this is the right diagnosis. I was also getting 60 fps in the profiler before I found this code, and the number didn’t move afterward until I actually fixed it, so at minimum “something was stalling” was true.

The readback ring

The fix is a proper async readback pipeline. ReadbackRing is a new type in lux-render that tracks in-flight readback requests through a state machine:

Submitted → Mapping → Done

When a node requests a readback, the ring:

Gets a staging buffer from its pool (bucketed by power-of-two size, so the same buffer is reused across frames for same-size readbacks).
Encodes a copy_texture_to_buffer onto the current frame’s encoder.
Records the request as Submitted.
After submit, calls buffer.map_async(Read) with a callback that updates a shared Arc<Mutex<Option<Result<...>>>> slot when the mapping completes.
Transitions to Mapping.

On the next frame, the top of render_frame calls device.poll(PollType::Poll) — the non-blocking variant — which drives the callbacks forward without waiting. Any mapping that completed since last frame runs its callback, which deposits the bytes into the ring’s completed queue. The CPU picks them up and hands them to Vello’s image cache.

The key difference: Poll never blocks. It’s a pump. It drives async work forward, returns immediately, and if no work is ready, it just returns. The CPU is free the instant the pump returns. Triple-buffering works again.

The cost is that snapshots lag the GPU by 1-2 frames. A TextureToLayer that samples a texture written this frame will see the texture as it was 1-2 frames ago. For most creative coding patches this is fine — the pool’s last_written_frame gate already prevents redundant reads, and the previous snapshot is always drawable, so nothing stalls waiting for a fresh one.

For export paths (ExportImage, --dump-frame), synchronous behaviour is actually what you want — the export node is expected to finish a single frame completely and write the result. The old blocking path got renamed to readback_texture_blocking and retained exclusively for those flows.

Six new unit tests verify the ring: bucket math, empty-ring state, zero-dim rejection, single round-trip, bucket reuse across 4 frames (exactly 1 staging buffer allocated), and two simultaneous readbacks completing from one submission.

After this change, the 10k-cubes demo ran at its full GPU-limited rate with no CPU stall. Frame times dropped from 14 ms to 8 ms on the same hardware. Which was about 3 ms of “GPU wait” that had nothing to do with the GPU and everything to do with poll(Wait).

The StagingBelt

The second fix is the uniform upload path. Every shader pass in Lux carries a small uniform block — a few floats, a vec4 or two, maybe a mat4. The old path was:

let buffer = pool.get_or_create(size);
queue.write_buffer(&buffer, 0, data);
// ... record pass referencing buffer ...

queue.write_buffer looks innocent. It’s actually a driver call that queues a memory copy for the next submit. Every call is a separate queued operation. Ten shader passes per frame = ten queued copies, each with its own book-keeping and synchronisation overhead. On integrated GPUs with slower memory paths, this becomes a measurable fraction of frame time.

The right answer is wgpu::util::StagingBelt. The belt is a CPU-side ring of staging chunks that the application writes into and then flushes to the GPU as a batched set of copies inside the frame’s command encoder. The shape is:

1. Passes record belt writes (into CPU-side chunks).
2. belt.finish() — flush pending writes into the frame encoder as copy_buffer_to_buffer commands.
3. queue.submit(encoder) — everything goes in one go.
4. belt.recall() — reclaim chunks whose copies have completed.

Crucially, the belt doesn’t do one write_buffer per pass. It accumulates CPU writes, then flushes them as a single batch of buffer-to-buffer copies on the GPU-side encoder. The driver sees one block of copies instead of N, and the overhead drops accordingly.

Migrating TextureEngine meant replacing the three hot-path queue.write_buffer calls in RunShader, RunCompute, and RunComputeBuffers with belt writes. The UniformBufferPool stays in place for the target buffers themselves; the belt only handles the upload half. I left a migration TODO on the pool for when I eventually replace both halves with a fully belt-backed path.

Lifecycle is the fiddly part. belt.finish() and belt.recall() have to bracket the submit that actually consumes the copies. Most frames have one submit at the end, which is simple. But some flows — BuildEnvironment during IBL bake, ExportImage during export — do intermediate submits. A belt_uploads_pending flag tracks whether the belt has queued work, and the helper write_uniform_via_belt returns a bool so the caller knows whether the belt path was taken. Whenever the flag is true, the surrounding submit path has to do a finish/recall pair. Annoying; necessary.

The static audit

While I was in there, I added an integration test that statically greps app.rs and texture_engine.rs for PollType::Wait, pollster::block_on, and recv_timeout — the three incantations that introduce a sync stall on the hot path. The test skips comments and #[cfg(test)] blocks. If someone accidentally reintroduces a blocking poll inside the render loop, CI fails.

queue.write_buffer is deliberately allowed in a handful of places: mesh vertex/index uploads, and the 3D render path’s FRAME/LIGHT/DRAW uniform blocks. Those are outside the texture-ops hot loop and rewriting them on the belt path would risk a regression for a speed-up that’s in the noise. The audit test is documented with exactly these exceptions so nobody wonders why the grep doesn’t cover them.

LuxApp::render_frame now has a doc block spelling out the three hot-path invariants:

No blocking poll on the hot path.
No synchronous readback on the hot path.
No per-op write_buffer on the hot path.

Anything that needs to break one of these has to justify itself in a comment, in the commit, or in an explicit opt-out from the audit.

The numbers

The combined impact is roughly 3 ms of frame time on the 10k-cubes demo and about 1.5 ms on a typical 2D texture-chain patch. The cubes demo sits comfortably at 120 fps on my desktop now, and at 60 fps on the low-power integrated GPU I’ve been using as the “does this still run everywhere” reference.

What’s left? The mesh-upload path still uses queue.write_buffer for vertex and index data, which is fine — meshes upload once and stay resident. The FRAME/LIGHT/DRAW uniform blocks in the 3D render path still use queue.write_buffer directly, and a future pass will migrate those onto the belt. Nothing else on the hot path blocks.

This is the last render-side post in this cluster. The next three posts move to editor polish — welcome splash, sample patch, wire-drag refinements, the F8 profiler. Everything that makes the app feel like a tool you can hand someone, not just a render pipeline you can demo.

I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.