Shadow Orchestrator + 1024-Slot Lights

Until last week, the live shadow path on Lux was a hardcoded orthographic projection from -10 to +10 on the X and Y axes. It was a 1024² depth atlas. The bias was the literal value 0.005, set in mesh.wgsl as a let bias = 0.005;. The shadow pass dispatched through an orphan helper called dispatch_shadow_framegraph that nobody had a reason to maintain.

The light loop ran against an 8-slot LightBlock uniform. If you put a ninth light in your scene, the ninth was silently dropped. The shader’s per-fragment loop iterated 0..8, no extension point. There’s a 1024-slot LightStore SSBO in the codebase, with a cluster-bin compute pipeline that bins lights into 8192 spatial cells. It had been there for months. The shader was not reading from it.

The orphan dispatch and the 8-light cap and the 1024² atlas and the 0.005 bias and the ±10 ortho were all from the same era (early 3D, when “make a triangle render” was the bar), and they had been quietly load-bearing ever since. This post is them finally going away.

ShadowOrchestrator

There is now exactly one thing that owns shadow dispatch: ShadowOrchestrator::emit_passes. It owns the cascade-snap history (one round per CSM cascade), the cube round-robin index (for future cube/spot shadows), and the budgets the dispatch boundary needs.

The orchestrator gets called once per Render3D op from dispatch_render3d, which is the live path’s only entry point for shadow work. Hand-rolled add_pass(...) for shadow work outside the orchestrator is a review-block. Duplicating the cascade-snap history in a sibling orchestrator would silently break the round-robin and produce shimmering at the cascade boundaries. Making this a single-owner contract prevents that future from arriving by accident.

The dispatch_shadow_framegraph orphan helper is gone. There’s one helper, it has one caller, the caller is the live render path.

CSM cascades

Cascaded Shadow Maps for directional lights. Four cascades, PSSM-blended (Practical Split Shadow Maps with the standard 0.85 lambda), each cascade rendered into its own layer of a 4096² Depth32Float 2D-array atlas. Texel-snapped per-cascade view-projections (Valient 2012) so sub-texel camera translations don’t make the shadow boundaries swim.

The texel snap is the part that took me a while to get right the first time around. Without it, a static scene with a moving camera produces visible shimmer at the horizon, because the cascade’s texel grid is shifting under the geometry by sub-pixel amounts every frame. The snap rounds the cascade’s view-projection to the nearest whole texel, which keeps the grid stationary relative to the world.

There’s a test that asserts this: sub_texel_camera_translation_does_not_swim_cascades translates the camera by 0.1 pixels and re-runs the cascade math, asserting the resulting view-proj matrices match within 1e-2 ULP. If the snap regresses, the test catches it.

The cluster-bin

For everything that isn’t a directional cascade (point lights, spot lights, area lights, IES lights), the live mesh shader now iterates cluster_bins[tile_id] over LightStore. The cluster grid is 16×16×32 (X, Y, log-Z), giving 8192 spatial cells. Each cell holds a list of light indices that touch its frustum.

The cluster-bin compute pass runs once per frame, after LightStore::flush(). It reads the camera’s view frustum, walks every active light in the store, and writes its index into every cell whose bounding sphere it overlaps. The pass is on the GPU compute queue (per the QueueAffinity work from a couple of posts back). It’s hot-path safe: zero device.create_* calls per frame, all buffers pre-allocated and reused.

The shader’s per-fragment light loop reads cluster_bins[tile_id_for_this_pixel], walks the index list, and accumulates shading from each light it points at. There’s no fixed cap. A scene with 1000 lights binned into the cluster grid will iterate exactly the lights that touch the current tile, which is typically 4 to 8, occasionally 12 to 16, never 1000.

The 8-slot LightBlock uniform that used to hold the lights is deleted. So is pack_light_block. So is LIGHT_BLOCK_SIZE. So is LIGHT_TYPE_NONE in non-test code (it survives in test assertions to gate against the regression). Grep tests in CI verify the deletions stick.

The light type bug

Here’s the ugly one. The LightBlock uniform packed each light into a 16-byte block whose first u32 was the light type. The shader’s loop checked if type == LIGHT_TYPE_NONE { continue; } to skip empty slots. Light types 0 through 4 (None, Directional, Point, Spot, Ambient) had been wired correctly. Their values matched the enum discriminant.

Light types 5 through 9 (AreaSphere, AreaRect, AreaTube, AreaDisk, IES) had not. The packing code was packing them as LIGHT_TYPE_NONE = 0, because the developer who wrote the packing code (me) had stopped at “wire up the basics” and never come back to add the rest. So every area light and every IES light in every scene was getting silently dropped by the shader’s NONE guard before it ever reached the loop.

You could put 200 IES streetlamps in a scene, and the shader would render zero of them, and there was no error. The lights were “in the LightStore” in the sense that the CPU code had added them. They just packed as NONE and got skipped.

light_store::from_lux_core_light is the new adapter. Every variant of lux_core::scene::Light produces a real, non-NONE discriminant. Test: from_lux_core_every_variant_has_real_discriminant enumerates all 9 variants and asserts each one packs to its actual type value. Existing scenes that had area lights “in” them now actually render them. There were a couple of test patches whose visual goldens had been wrong this whole time, and got rebaked accordingly.

Bias, finally

Two separate bias systems gate against shadow acne and Peter-Panning, both replacing the literal let bias = 0.005; that used to be in mesh.wgsl.

The cascade-raster pipeline bias is slope-scaled at the depth-bias state. CSM_SLOPE_SCALE = 2.0, CSM_BIAS_CLAMP = 0.01, CSM_CONST_BIAS = -0.0003 (reverse-Z NDC). These values come from a sweep against the test-patch suite. Smaller and you get acne on flat surfaces. Larger and you get visible Peter-Panning at object boundaries.

The receiver-side sampling bias is computed in the fragment shader: 0.002 + 0.004 * max(1 - n_dot_l, 0). Depth-epsilon plus a normal-incidence-scaled component. Algebraically equivalent to the canonical shadow_bias() helper in shadows.wgsl, which I should be calling instead of inlining. (Future cleanup. The math is correct.)

A new hardcoded scalar bias would be a review-block. Slope-scaled or nothing.

The pixel suite

Four tests verify that the shadows actually look right, beyond just “the shader compiled”:

Cascade-seam ΔE76: a 16-sample line crossing the cascade boundary asserts pairwise ΔE76 < 2 (no visible seam).
PCSS penumbra widens with occluder height: five occluder heights produce penumbra widths that monotonically increase with at least a 1.5× ratio between min and max.
No shadow acne on a flat plane under a directional caster: 4096-sample average bias < 4/255, stddev < 2/255, zero outlier pixels.
No Peter-Panning on a grazing-angle box: the gap between the box’s base and its attached shadow is ≤ 4 pixels (slack from the 3×3 PCF kernel).

All four green. Revert the pipeline to the old bias and one of them fails.

Three new patches

phase2_1000_lights is a stadium with 1000 point lights binned into the cluster grid. Renders correctly. The cluster-bin compute pass touches every light, and the per-fragment loop iterates only the local subset.

night_club_200_lights_shadows is 200 dynamic point lights plus a directional caster with CSM. Both subsystems active. SSIM 1.0000 against the reference.

damaged_helmet_with_csm is the PBR test mesh under directional light with cascades. Stand-in sphere again, because the .glb loader is still pending. Gate is self-consistency at SSIM ≥ 0.95 and ΔE76 mean ≤ 5. When the loader lands, the patch swaps the sphere for the real helmet in a one-line node-input change.

What’s still deferred

The cube round-robin scaffolding is in place: ShadowOrchestrator.cube_budget and cube_rotation_slot advance per frame. There’s no live cube allocator yet, because no cube/spot raster pass exists yet on the live path. sample_shadow_for_light returns 1.0 for Point and Spot light types (no shadow). The orchestrator’s machinery is ready for the consumer; the consumer arrives in a later post.

VSM page-marking is similar. The orchestrator declares the pass on the framegraph with correct queue affinities and dependency edges. The pass body is a stub. It lands when virtual shadow mapping goes live, alongside the bindless mesh path.

What this unblocks

The live shadow path no longer carries a hardcoded ortho extent or a magic-number bias. The 8-light cap is gone. The 1024-slot LightStore is iterated per-fragment via cluster bins. Every light type packs as its real discriminant. The cluster-bin pass is running. Three new merge-blocking goldens passed. Four pixel-level shadow correctness tests passed.

The next post is the meshlet path’s status update. The cull-ratio gates closed in this round. The bindless mesh path is queued for the rest. The bindless shader’s been authored for a while. It’s also been unread for a while. The post explains why.

I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.