The Bindless Arm Goes Live

How much can you put on screen before it stutters? Not how pretty one object looks, but how many of them you can have at once, moving, lit, casting shadows, before the framerate gives. That ceiling is set less by the GPU than by how the renderer talks to it, and for most of Lux’s life the renderer talked the slow way: here is an object, draw it; here is the next, draw it; one CPU-issued call per mesh, all the way down. That holds until a scene has a few thousand objects, and then the CPU spends the whole frame reading out a shopping list and there is no time left to actually render. This post changes how Lux talks to the GPU.

The Meshlet Path, So Far was a status post with one line in it worth quoting back:

The bindless mesh shader is 778 carefully-authored lines of WGSL that the pipeline doesn’t load yet, because the pipeline still loads a 3-line placeholder that draws degenerate triangles.

The placeholder existed because the real shader, mesh_bindless.wgsl, declares seven bind groups (bindless textures and samplers, the materials SSBO, the draws SSBO, frame uniforms, lighting and cluster bins, the shadow atlas, IBL) and the pipeline only built three. Load the real shader against a three-group layout and wgpu rejects it before it draws a pixel. So the pipeline loaded three lines of WGSL that drew nothing, and 778 carefully-authored lines sat next to it, unread, like a novel nobody opened.

This post is the placeholder dying. It is Passes 2 and 3 of the meshlet contract: bindless WGSL adoption, and the live indirect dispatch. At the end of it, the scene draws in one call.

778 lines finally get loaded

The first move is the boring one. Extend BindlessMeshPipeline’s bind-group layout from three groups to seven, matching the shader’s declared bindings one to one. Four new layout builders, four new getters, a hardware vertex-buffer layout describing the 48-byte interleaved vertex.

Then PLACEHOLDER_WGSL is deleted and include_str! points at the real mesh_bindless.wgsl. The shader is composed through PrismComposer (the same naga_oil-backed composer the rest of the engine uses) with the TEXTURE_AND_SAMPLER_BINDING_ARRAY capability enabled, so the 4096-slot binding_array is legal. That binding array is the “bindless” part, and it is the part that reaches you: it lets one shader address four thousand textures without the renderer stopping to rebind between materials, which is how a scene can have hundreds of distinct materials without paying for it in draw overhead. The capabilities are additive. No other shader notices.

The meshlet post promised something specific about this moment: “the PLACEHOLDER_WGSL constant gets deleted in the same commit as the include_str! swap. There won’t be a transition period where both are valid.” It deleted in the same commit. No transition period. The placeholder did not get to retire gracefully.

The orchestrator

A loaded shader is not a live renderer. The shader needs a MeshletPool, a DrawStore, and a MaterialTable populated from the scene every frame, and TextureEngine didn’t own any of them.

BindlessOrchestrator is the new owner. It is a field on TextureEngine, built lazily and only on bindless-capable adapters, and it holds the MeshletPool, the DrawStore, the MaterialTable, and the BindlessMeshPipeline together. Its one job is populate_from_scene: walk the scene’s draw list, give every mesh and material a slot, and produce the per-draw data the indirect path consumes. This is the thing that hands the GPU the whole scene at once instead of feeding it one object at a time.

The slot allocation is the part that has to be right. Mesh slots are keyed by MeshHandle, material slots by Arc<Material> pointer identity. Both are stable across frames: a scene that doesn’t change its meshes and materials gets the same slot assignments on frame 2 as on frame 1, with no churn. That stability keeps the renderer from doing avoidable work every frame, and it is what lets the next piece cache.

One prerequisite fell out along the way. The MeshPool had to start retaining Arc<MeshData> for persistently-uploaded meshes, because the orchestrator needs to read a mesh’s actual vertex arrays to pack them into the meshlet pool, and until now the pool threw the CPU-side data away the instant the GPU upload finished. Very tidy of it. Slightly too tidy.

Seven bind groups, cached

record_via_draw_store needs all seven bind groups built every frame. Building seven wgpu::BindGroup objects per frame on the hot path is exactly the kind of allocation the framegraph’s cache invariant exists to forbid, because an allocation every frame is a hitch waiting for the busiest moment to spring it. Per-frame create_bind_group is a review-block.

So the orchestrator routes all seven through scene_bind_group_cache. Six of the seven draw on data that changes rarely (lighting, shadows, IBL, the scene’s standing buffers), so from frame two on, six of the seven are cache hits. Only the per-frame uniform group rebuilds, and only when its uniforms actually move. Seven allocations a frame became roughly zero, which is one of the quiet differences between a framerate that holds and one that sags as the scene fills.

The live arm

With the orchestrator populated and the bind groups built, dispatch_render3d gets its branch. On a tier-3 adapter with the orchestrator present, the live Render3D op emits the meshlet cull pass and then calls BindlessMeshPipeline::record_via_draw_store, which issues a single multi_draw_indexed_indirect_count: one indirect draw call for the entire scene, with the count driven by what survived the cull on the GPU.

Read that last part slowly, because it is the whole point. The GPU culls the scene and the GPU decides how many objects to draw, in the same frame, without a round trip back to the CPU to ask. The CPU stops reading out the shopping list. It hands over the whole scene once, and the GPU draws what is actually visible. That is the move that lets a scene get genuinely dense, thousands of distinct objects, and still come in under the frame budget.

On every other tier, and on llvmpipe, the path is unchanged. Bit-identical to before the flip. The bindless arm renders. Live. On real hardware. For the first time.

Two renderers, on purpose

Here is the deliberate part, worth being explicit about because it looks like exactly the thing the debt sweep spent a thousand words arguing against: there are now two mesh renderers in the tree. The bindless arm, and the legacy render_3d::execute.

That is not a tenet violation. It is a safety net, and the thing it protects is your picture.

The only way to know the new fast renderer is correct is to render the same scene both ways and compare, and you cannot do that once you have deleted one of the two ways. Delete the legacy renderer in the same commit that flips the live arm, and if the bindless output is subtly wrong, your scene goes wrong with nothing to diff against and no commit to bisect to. Just a worse picture and a shrug. So the flip and the deletion are deliberately two different posts. This one flips the speed on; the legacy renderer stays, briefly, as the thing the new one gets measured against, so that the day the fast path becomes the only path, it has already been proven to match.

It is not yet perfect, either. The first version of the live arm filled the bindless material-texture array with placeholder data: the geometry was real, the indirect draw was real, the textures were a polite fiction. Making that array carry real per-material textures, and closing a recycle hole in how its slots got freed, were their own follow-up commits. Where this post leaves it: the bindless arm is the live renderer on capable hardware, it draws the scene through one indirect call, and it has a known list of things that are not right yet.

The next post works that list to zero, brings the bindless arm to full feature parity (shadows, skinning, instancing), and then, finally, deletes the legacy renderer. What you get when that lands is a renderer that no longer asks the CPU to introduce your scene one object at a time, which is the difference between a few thousand things on screen and a few hundred thousand.

Still building this in the open. Follow along.