10,000 Cubes in One Draw Call

There’s a number you hit, building a creative coding tool, where the single-draw-call-per-object model stops being adequate. It’s somewhere between a hundred and a thousand. Enough that the per-draw CPU overhead (bind group switches, uniform uploads, dispatch) starts to dominate whatever shading math the GPU is actually doing. Every tool in this space has to deal with it eventually. For Lux, today, I wanted that number to be 10,000.

This is the GPU instancing post.

The idea, briefly

When you draw 10,000 cubes, you don’t want to issue 10,000 draw calls. You want to issue one draw call that says “draw this cube 10,000 times, and for each instance, look up its model matrix from this buffer.” The GPU loops over the instances internally, runs the vertex shader 8 × 10,000 = 80,000 times (once per cube vertex × instance), and writes every pixel in parallel.

The draw-call side is free. The bottleneck becomes “how fast can you tell the GPU where each instance goes,” which is a question about uploading a 64-byte matrix per instance to a storage buffer and not touching it again until something changes.

Everything below is the plumbing to make that possible.

A new pin type

PinType::InstancedMesh is the wire format for an instanced draw. The underlying struct is 20 bytes:

pub struct InstancedMesh {
    pub mesh: MeshHandle,              // 8 bytes
    pub instance_buffer: BufferHandle, // 8 bytes
    pub count: u32,                    // 4 bytes
}

One mesh, one buffer of per-instance matrices, one count. That’s the entire contract. A node that produces an InstancedMesh is promising: “here is a mesh handle you can draw, here is a storage buffer the GPU can read model matrices from, and there are exactly count of them packed at offset zero.”

The 40-byte PinValue invariant still holds comfortably. InstancedMesh is 20 bytes inline, no box needed.

A real BufferPool

Last post alluded to BufferPool existing “for future use.” This is the post where it stops being a stub.

BufferPool is the third pool in the render layer, after TexturePool and MeshPool. Same shape: handle-based allocation, free-list reuse, frame-based eviction, a mark_in_use op to keep entries alive. New this session: a BufferPool::upload method that wraps queue.write_buffer with a bounds check and a length log on the debug path.

The texture engine now dispatches AllocBuffer, UploadBuffer, FreeBuffer, and MarkBufferInUse to the real pool instead of the warning stubs that were there as placeholders. The frame-periodic sweep that used to cover only textures and meshes now covers buffers too. A buffer that hasn’t been touched in ~180 frames gets reclaimed, same as a stale mesh.

This is entirely unglamorous, the kind of work that doesn’t change any visible behaviour on its own, and it’s the thing that unlocks everything in this post.

The instanced pipeline variant

mesh_pipeline_cache learned a new cache key component: instanced: bool. When set, the pipeline compiles with a MESH_INSTANCED shader-def, and the vertex state gets a second VertexBufferLayout alongside the main one. The second layout has step_mode: VertexStepMode::Instance and describes a 64-byte packed mat4 at vertex shader locations 4, 5, 6, and 7. One vec4 column of the matrix per location, because WGSL vertex attributes max out at vec4 and mat4x4<f32> in WGSL is constructed from four column vectors.

Inside mesh.wgsl, the vertex shader main function now has a branch:

#ifdef MESH_INSTANCED
    let model = mat4x4<f32>(
        in.model_c0,
        in.model_c1,
        in.model_c2,
        in.model_c3,
    );
#else
    let model = draw.model;
#endif

Non-instanced draws read the model matrix from the per-draw uniform, the way they always did. Instanced draws read it from the instance vertex stream, one unique matrix per instance, no uniform buffer touched. Same vertex shader, same fragment shader, one #ifdef.

render_3d::execute now takes a &BufferPool alongside the mesh pool, warms the distinct (MaterialClass, instanced) pipeline set, binds the instance buffer at vertex slot 1 for instanced draws, and issues rpass.draw_indexed(0..n_indices, 0, 0..count) where count is the instance count from the InstancedMesh bundle. The GPU does the rest.

The authoring nodes

Two new nodes in lux-scene-primitives:

GridTransforms3D. Rows, columns, size, height. Emits a Spread<Matrix4>: one translation matrix per grid cell, arranged in a rectangular XZ grid, with an optional per-cell Y offset driven by a sine wave of distance from the origin. This is an authoring helper, nothing more. It gives me a deterministic, visually interesting set of 10,000 transforms so I have something to stress-test the instancing path with. Cols × rows is clamped at 1024 × 1024 for paranoia.

InstancedMesh. Takes a Mesh and a Spread<Matrix4>, produces an InstancedMesh. This is the node that actually does the upload. It’s stateful, holds a BufferHandle across frames, and here’s the interesting bit: it blake3-hashes the packed column-major bytes of the input transform spread and only uploads when the hash changes.

The upshot is that a 10,000-instance static grid uploads once at startup. Every frame after that, the input hash matches the cached hash, the node skips the upload entirely, and the per-frame cost of 10,000 instances is: one hash computation (fast), one mark_buffer_in_use op, one render_3d op with count = 10000. The vertex shader fires 80,000 times on the GPU and nobody on the CPU does any work worth mentioning.

If you do want the transforms to animate every frame (say, feeding in a noise field), you hand in a new spread every frame, the hash misses, the buffer uploads, and you pay 640 KB of PCIe traffic per frame. Still one draw call. Still 10,000 cubes.

The stale handle bug

Here’s the part where I got humbled.

I wired up the demo patch: Box → InstancedMesh(GridTransforms3D(100, 100)) → RenderScene. First run: perfect. 10,000 Lambert-shaded cubes on a sine wave, one draw call, locked solid at frame budget. I took a screenshot. I wrote half of this blog post in my head. I went to get coffee.

I came back, moved the mouse over the orbit camera, and watched the cubes vanish. Not gradually. Not in groups. All at once, about three seconds after I stopped interacting with the patch. The logs were full of warnings from the render pass: instance buffer missing for draw 0.

The bug took an hour to find and is going to take two paragraphs to explain.

The lifecycle contract from the first post says: every stateful GPU node has to call mark_*_in_use on its handles every frame, otherwise the pool’s frame-based sweep will reclaim them. I knew this. I wrote the contract. The InstancedMesh node’s process() method dutifully called mark_mesh_in_use and mark_buffer_in_use every frame.

Except the InstancedMesh node is stateful, not always_dirty. Which means the evaluator only re-runs its process() when one of its inputs has changed. And the whole point of the content-hash optimization is that the inputs don’t change on steady-state frames. So the evaluator was correctly deciding “nothing’s changed, skip this node,” the mark_in_use calls were never firing, and ~180 frames later the pool’s sweep reclaimed the buffer out from under a node that was very much still alive and still planning to use it.

You can see why this happened. You can also see why it’s a much deeper bug than it looks: the keep-alive work has to happen every frame, but the node that knows about the buffer isn’t running every frame. The two requirements are in direct conflict, and no amount of “just remember to mark in use” discipline will fix it, because the node isn’t even in the graph’s evaluation list on the frames it matters.

The fix: move the keep-alive to the consumer, not the producer. RenderScene is .always_dirty() (it has to be, it runs the render pass every frame). It sees every DrawItem in its cached scene. Every frame, before dispatching the render op, it now walks the draw list and calls mark_mesh_in_use and mark_buffer_in_use on every handle it finds:

for draw in &self.cached_scene.draws {
    if !draw.mesh.is_invalid() {
        ctx.mark_mesh_in_use(draw.mesh);
    }
    if let Some((buf, _)) = draw.instances {
        if !buf.is_invalid() {
            ctx.mark_buffer_in_use(buf);
        }
    }
}

The consumer knows the draw list. The consumer always runs. The consumer owns the keep-alive, and the producer can sleep peacefully on frames where its inputs haven’t changed, confident that its downstream friend will vouch for the buffer’s continued relevance.

This generalizes. Every resource handle that flows across a wire and feeds into an always_dirty terminal should have its keep-alive owned by the terminal, not the source. The source might be a cached shortcut; the terminal is the one the evaluator is guaranteed to visit. I’ve added this as a rule to the same lux-core::mesh module doc that carries the lifecycle contract. Future 3D producers are welcome to stop caring about mark_in_use entirely.

What it feels like

The demo graph is now Box → InstancedMesh(GridTransforms3D(100, 100)) → RenderScene. 10,000 cubes. One draw call. 64 bytes of CPU work per frame (the hash computation on unchanged input). A sine-wave radial displacement because I couldn’t resist. I orbit the camera around it and the whole field breathes.

‘A 100 by 100 grid of Lambert-shaded cubes under a sine wave’

I left it running for twenty minutes to make sure the stale-handle bug was actually gone. The cubes stayed.

Next post is the quiet one. The compute shader infrastructure that’s going to let the particle system do exactly this trick, but with the per-instance positions coming out of a physics simulation instead of a hashed spread. No visible output. Just the reason the post after it can ship.