Compute Shaders Get Real Buffers

This is the quiet post in the series. There is no screenshot at the end. The demo graph looks identical to the one in the last post. I spent a whole session on something you can’t see.

But nothing in the next four posts can ship without this one, so I’m going to walk through it anyway.

What was already there

The ShaderCache from the custom WGSL shaders post has been able to run compute shaders for a while, through a TextureOp::RunCompute path. That path is simple by design: one WGSL source, one uniform block, one write-only storage texture as output. You can do a lot with that shape. It’s how NoiseTexture and SolidColor moved onto the GPU in the trust pass, and it’s how a couple of texture filters got accelerated earlier.

What it cannot do: read and write buffers. The texture-only shape was a deliberate narrowing, because “storage texture as output” has exactly one binding type to reason about and exactly one bind group layout to cache. Every future compute-heavy node wants something different.

A particle simulation wants a storage buffer it can read from and write to, with the previous frame’s positions as input and the current frame’s positions as output. A spatial hash wants a read-write buffer of cell indices. A prefix-sum pass wants an input buffer, an intermediate, and an output, all at different bindings, all with different access modes. A fluid solver wants four or five textures bound for read plus two buffers bound for write. None of that fits into “one write-only storage texture.”

So I need a second compute path. Not to replace RunCompute, just to live alongside it. This post is the second path.

The new module

compute_buffers_cache.rs is a new module in lux-render. It’s about 370 lines. It looks a lot like the existing shader_cache.rs for compute, but with a fundamentally different binding shape:

pub struct CachedComputeBuffersPipeline {
    pipeline: wgpu::ComputePipeline,
    uniform_layout: wgpu::BindGroupLayout,          // @group(0)
    storage_buffer_layout: wgpu::BindGroupLayout,   // @group(1)
    storage_texture_layout: wgpu::BindGroupLayout,  // @group(2)
}

Three bind group layouts per pipeline. Group 0 is uniforms, always present even if the shader doesn’t declare any. Group 1 is N storage buffers, each with its own access mode (read, write, or read_write). Group 2 is M read-only storage textures. Both 1 and 2 can be empty, and groups 0 and 2 exist unconditionally so the @group(n) indices in WGSL stay stable regardless of whether a particular shader uses textures or not. If you don’t do this, a shader that skips textures ends up with its buffers at @group(1), but a shader that uses textures ends up with its buffers at @group(1) and a binding you weren’t expecting, and the binding indices become unpredictable across shader variants. Stable indices are cheap to reserve and expensive to retrofit.

The cache key

The cache key is the interesting design choice. For ShaderCache::RunCompute I had been hashing just the source string, because the binding layout was fixed. For this new cache, different shaders can request different layouts even if the source is identical, and the same source can request different access modes on its buffer bindings. So the cache key has to include everything that affects pipeline compilation:

struct ComputeBuffersCacheKey {
    source_hash: u64,          // FNV-1a of the WGSL source
    uniform_count: u32,        // 0 or 1
    buffer_access: Vec<BufferAccess>,  // per-binding access mode
    texture_count: u32,        // read-only storage textures
}

FNV-1a because I wanted something stable, fast, and unambiguous. The per-binding BufferAccess fingerprint is what lets “same source, different access modes” compile to distinct pipelines. If you write a shader that treats binding 0 as read-only and someone else writes a byte-identical shader that treats binding 0 as read-write, those are two different pipelines. They have to be. The bind group layout descriptor is different, and wgpu will validate against it.

Cache miss: compile the shader, build the three layouts, construct the pipeline, store the result. Cache hit: return the Arc-wrapped entry. Compile failures are cached as failures (the same lesson from the mesh pipeline cache cleanup last month: cache success and failure, otherwise a broken shader re-spends 50ms per frame trying to re-compile).

The #import question

Here’s the part I almost got wrong.

The existing compute path in ShaderCache::RunCompute sends raw WGSL through wgpu::Device::create_shader_module. It works. It also doesn’t handle #import lux::sdf_ops::smin or any of the other shared helper modules I’ve been building up in lux-render/src/shaders/modules/. Those imports get resolved by naga_oil, which the texture engine’s PrismComposer wraps, and up until now the composer was only on the render-shader path (for RunShader, fullscreen fragment passes, and the mesh pipeline).

If I shipped compute_buffers_cache without routing through the composer, I’d have a compute path where #import lux::random::pcg doesn’t work and a render path where it does. That’s the kind of inconsistency that looks fine until you try to move a helper function between a render shader and a compute shader and discover that two halves of the same engine speak two different dialects of WGSL.

So there’s a new helper, compile_via_composer_for_compute, that routes the source through PrismComposer before handing the result to wgpu. #import lux::* resolves identically whether you’re in a fullscreen fragment shader, a mesh vertex shader, or a compute shader dispatching 64 threads. Import sdf_ops::smin in a compute shader and it works. Import lux::random::pcg in a fragment shader and it works. One composer, one import graph, no dialects.

On composer failure the new path falls back to raw WGSL, same as RunShader does, so a shader with no imports is still valid. No regressions, one new capability.

Two-phase borrow

This is a small detail but it took me 20 minutes of fighting the borrow checker so it’s going in the post.

The dispatch function, run_compute_buffers_pass, has to do two things: warm the cache (which takes &mut ComposerAndCache because it might compile a new pipeline) and then build bind groups and dispatch (which needs immutable access to the cached pipeline plus mutable access to the buffer and texture pools to mark handles in use). If you try to do both with a single borrow you end up holding &mut self across the dispatch, which prevents the pool borrows, which prevents the dispatch.

The fix is the boring standard trick: split into two phases.

// Phase 1: warm the cache (mutable composer + mutable cache)
let pipeline = {
    let (composer, cache) = (&mut self.composer, &mut self.compute_cache);
    cache.get_or_compile(composer, source, key)?.clone()
};

// Phase 2: build bind groups and dispatch (immutable pipeline,
// mutable pool borrows)
let bg_uniforms = build_uniform_bg(&pipeline, &uniforms, &self.buffer_pool);
let bg_buffers = build_buffer_bg(&pipeline, &bindings, &self.buffer_pool);
// ...
encoder.begin_compute_pass(...)

Cache handed out an Arc<CachedComputeBuffersPipeline>, first phase completes, the &mut on composer and cache both drop at the end of the block, and phase two gets clean mutable access to the pools. Simple, once you see the shape. The lesson, for anyone who’s going to write a fourth pool at some point: always make your cache entries clonable cheaply, because the alternative is a lifetime puzzle that locks you into a specific dispatch order.

Missing handles

One lifecycle detail worth pinning down before it becomes a bug in the next post.

If the shader is told to bind a storage buffer whose handle is stale, or a storage texture whose handle got reclaimed, the dispatch logs a warning and skips. It does not panic. This matches the instanced-render path’s tolerance for stale handles, and it matches the spirit of the lifecycle contract: pools must treat unknown handle IDs as “log warn, not crash.”

This matters because the next three posts are all going to be stacking compute dispatches on top of each other in chains that are harder to reason about than a single render pass. A transient stale handle during an undo or a hot-reload is not a reason to take the editor down. It’s a reason to skip the dispatch for one frame and let the producer re-allocate on its next process() call.

The smoke test

I wrote exactly one end-to-end test for this module, and it’s the test I care about:

#[test]
fn compute_buffers_writes_storage_buffer_end_to_end() {
    // Compile a real WGSL compute shader.
    // Allocate a 64-element storage buffer.
    // Dispatch 1 workgroup of 64 threads.
    // Each thread writes `gid * 3 + 1` to its slot.
    // Copy the buffer back through a MAP_READ staging buffer.
    // Assert all 64 u32 values match.
}

64 threads, one dispatch, verify every slot. This is the shape of the smallest useful compute shader you can write, and if it passes I know: the composer is routing correctly, the cache is returning a usable pipeline, the bind group layout matches the shader’s declared bindings, the buffer pool dispatched AllocBuffer and UploadBuffer correctly, the compute pass actually ran, and the copy-back works. Six things, one test, and it runs on Mesa llvmpipe so CI gets it for free.

Plus six unit tests for the cache key: key stability across equivalent inputs, and four tests proving each of the four signature discriminators (access mode, buffer count, texture count, source) produces distinct pipelines. The access-mode test is the important one. It’s the thing that’ll save me the first time I have two shaders with the same source and different intents.

What this unblocks

Everything I’m building for the rest of Phase 9 lives on top of this module.

GPU particles need a position buffer, a velocity buffer, and a simulation compute shader that reads the previous frame and writes the next. That’s two read/write storage buffers per dispatch. Not possible with the old texture-only compute path. Possible now.
Spatial hashing for any future collision or neighbour-lookup node needs a large uint buffer that a compute shader builds and a second compute shader reads from. Two dispatches, two buffer bindings, different access modes per dispatch. Possible now.
Prefix sums for any sort or compact operation need the three-buffer scan shape. Possible now.
Fluid solvers want a mix of textures for the velocity field plus buffers for particle positions. That’s the three-bind-group layout exactly. Possible now.

None of that ships in this post. All of it is unblocked.

What it feels like

The demo graph is the same 10,000 cubes from last week. The screen is identical. The only thing that changed is that somewhere in lux-render, there is now a 370-line file and a smoke test that together mean “the compute path can read and write real buffers.”

Next post is the one this was all for. SDF nodes. A fragment graph that compiles down to a single WGSL shader on the fly. The Prism composer from this post, working on user-authored graphs instead of hand-written source. This is where Lux starts doing the thing that made me want to build it in the first place.