Zero GPU Allocations Per Frame

I’ve written this post before. Zero-Allocation Evaluation was about killing CPU allocations in the graph evaluator. Stop Cloning Everything was about borrowing instead of copying on the hot path. This one is the GPU-side sequel: the texture engine was still allocating GPU resources on every frame, and that was exactly as much of a problem as you’d expect.

The lesson keeps rhyming. The way you make a 60fps pipeline reliable is not to do less work per frame; it’s to stop pretending each frame is independent when it isn’t.

The texture engine was allocating every frame

Three separate hot paths, each allocating something it didn’t need to:

1. A fresh CommandEncoder per shader pass. The engine was creating a new wgpu::CommandEncoder, recording a single render pass, and submitting it to the queue, once per shader dispatch. A patch with ten texture filter nodes was submitting ten separate command buffers every frame. Each submit() is a round-trip to the driver. Ten round-trips per frame, sixty frames per second, just because I’d never batched them.

2. A fresh wgpu::Buffer for every uniform upload. Every shader pass needs a uniform buffer for its parameters, the amount for Brightness, the radius for Blur, the degrees for HueShift. The engine was calling device.create_buffer_init() for each one, every frame. Ten filters = ten buffer allocations. Sixty frames = six hundred per second. All of them tiny, all of them immediately dropped after the pass finished, all of them absolutely unnecessary.

3. A fresh descriptor cache rebuild on every frame. The engine was clearing its texture descriptor cache every frame and re-querying the pool for every active handle. Clearing data you’re about to rebuild is a classic “I wasn’t thinking” pattern.

All three of these were giving me frames that were “mostly fine” most of the time, with occasional stalls when the allocator happened to be busy or the driver decided to flush. Exactly the kind of intermittent hitch that kills a live show and is impossible to reproduce on demand.

The single command encoder

The fix for the per-pass encoder is almost offensively simple: create one CommandEncoder at the start of execute(), record every shader and compute pass into it, and submit once at the end of the frame.

let mut encoder = device.create_command_encoder(&Default::default());
for op in ops {
    match op {
        TextureOp::RunShader { .. } => run_shader(&mut encoder, ...),
        TextureOp::RunCompute { .. } => run_compute(&mut encoder, ...),
        // ...
    }
}
queue.submit(Some(encoder.finish()));

One encoder, one submit, one driver round-trip per frame regardless of how many texture ops happened. The number of passes scales linearly with patch complexity; the number of submissions stays at one.

There’s no downside. GPU command recording was never the bottleneck, the bottleneck was the per-submission overhead. Coalescing them is pure win.

The uniform buffer pool

The fix for uniform allocation is a tiny pool that reuses GPU buffers bucketed by size.

pub struct UniformBufferPool {
    free: HashMap<usize, Vec<wgpu::Buffer>>,
}

Request a uniform buffer for 16 bytes. The pool rounds up to the next power of two (still 16), checks its free list for a 16-byte bucket, and either pops an existing buffer or creates a new one. When the frame ends, every buffer gets returned to the pool.

Power-of-two bucketing means the pool has at most ~10 distinct size buckets for anything reasonable (16, 32, 64, 128, …, 8192 bytes). Shader uniforms are almost always tiny, a handful of floats, a vec4, maybe a small struct. In practice most of the buckets stay empty and most allocations land in the 16- or 32-byte bucket.

After the first frame, the pool has enough buffers for every shader pass in the patch, and the per-frame allocation count is zero. New patches warm the pool in their first frame and then run allocation-free forever.

SmallVec for draw commands

One more alloc-hunt that wasn’t in the texture engine at all: DrawCommand::StrokePath used to carry its path segments as a Vec<PathSegment>. Wires between nodes are drawn as bezier curves, which means two path segments per wire, a move and a cubic. Every wire draw was heap-allocating a Vec to hold two elements.

A patch with 100 wires was allocating 100 tiny Vecs per frame. None of them ever grew beyond two elements. None of them needed to be on the heap.

Swapped the field type to SmallVec<[PathSegment; 4]>. Inline storage for up to four segments, spills to the heap only for longer paths. Wire rendering now does zero allocations for the common case. Longer paths (actual curves from path nodes) still work correctly; they just take the heap path when they need to.

Texture cache stays warm

The descriptor cache fix is trivial once you see it: don’t clear a cache you’re going to rebuild. The engine now treats the texture descriptor cache as persistent across frames, invalidating entries only when textures are actually freed from the pool. Every frame, the cache is already warm from the previous frame’s state, and the engine does zero work to maintain it unless something changed.

Before: every frame did N texture metadata lookups against the pool, where N was the number of active handles.
After: every frame does zero lookups unless a texture was freed or allocated.

16 missing tests

While I was in the texture engine, I realised a handful of recent features had shipped without tests, the kind of “I’ll add it later” that accumulates until you’ve got a whole batch of them, I added 16 tests in one pass, covering:

  • Multi-frame handle reuse for VideoPlayer and RenderTarget (regression guards for the leaks that got fixed a few sessions ago)
  • Uniform buffer pool recycling across frames
  • Command encoder coalescing behaviour
  • Draw command SmallVec inline-vs-spilled cases
  • Descriptor cache persistence and invalidation

The leak tests are the most important ones. They load a patch, run it for 100 frames, and assert that the pool’s active-handle count stays bounded. If a node starts orphaning textures, the test fails on the first run that would have mattered.

The frame budget, again

At 60fps you get 16.6 milliseconds to do everything. Graph evaluation, GPU dispatch, UI rendering, and display submission. There’s no slack. Every allocation is a small amount of memcpy plus a risk of allocator contention plus a risk of page faults plus a risk of the allocator deciding to do a slightly different thing than it did last frame.

The way you hit 16.6ms reliably is to do the same thing every frame. Same buffers. Same encoders. Same cache. The first frame is expensive; every frame after is free. That’s the contract.

The texture engine now honours that contract. It joins the graph evaluator and the layer pipeline on the list of things that do zero allocations in steady state. The list of things that still allocate is shrinking. UI panels on resize, font glyph caches on first use, new nodes being added, and none of them are on the frame’s critical path.

Zero isn’t the goal. Predictable is the goal. Zero just happens to be the easiest kind of predictable to verify.

← Back to devlog