Post-FX: Bloom, TAA, CAS, and Film Grain

Two posts ago the pipeline learned how to hold values above 1.0. One post ago it learned how to compose passes. This post is what those two things were for. Four post-process passes, each one opt-in, each one depending on the HDR buffer being real and the framegraph being there to route it.

The headline: the 28 existing 3D golden reference images still match byte-for-byte. Every new feature defaults off. You turn them on per-scene, on a RenderScene pin, and the framegraph adds or removes passes from the chain.

Jimenez bloom

The first bloom I shipped was the one in the extended texture processing post: extract bright pixels, box-blur them, combine. Three passes, produces a glow, works well enough at low intensities, looks like a programmer’s bloom at high ones.

The specific failure modes of a naïve bloom are hard to un-see once you know them. Fireflies: a single bright pixel blooms into a bright square blur, not a smooth fall-off. Halos: wide blur radii smear visible rings around bright objects. Hue shifts: box-blur done in the wrong color space tints the blur yellow or cyan depending on which channels clip first. All three make bloom look fake. All three come from the same root: box blur doesn’t know which neighbours are worth averaging and which ones are dominated by a single outlier.

Jorge Jimenez’s SIGGRAPH 2014 progressive bloom is the modern standard. Lux now runs it. It has three parts:

Soft-knee prefilter. Instead of a hard threshold (“if luminance > T, pass; else zero”), a soft-knee curve smoothly transitions from “fully attenuated” to “fully passed” over a narrow band around the threshold. Hard thresholds pop in and out as a pixel’s luminance wiggles around T. Soft knees don’t.
Karis-weighted downsample. A 13-tap filter organised into five 2×2 boxes. Each box gets a Karis weight (1 / (1 + luminance)) before averaging, which clamps the contribution of any one very bright pixel. This is the firefly killer. A single HDR 100 pixel no longer dominates its neighbourhood; its Karis weight drops it back to the same order as its neighbours.
Tent upsample with additive blend. 9-tap tent filter, additively blended into a fresh transient each iteration. WGSL doesn’t allow read+write of the same storage texture, so each upsample pass writes to a new resource — the framegraph aliases them down to ~2-3 physical slots across the 13-pass chain.

The whole thing runs on the HDR buffer before tonemap, which means the bloom’s contribution enters the ACES curve as additional radiance and gets tone-mapped with the rest of the image. Not composited on top of a tonemapped image, which is what a lot of engines do and why their bloom doesn’t feel like it’s physically part of the scene.

The quality test is the one I care about most: a 128×128 image with one pixel at HDR 100 and the rest at 0.1. The old box-blur bloom would flare that one pixel into a bright square halo with a peak luminance above 10.0 on the surrounding ring. Jimenez clamps it below 10.0. No flare. Just a soft glow that respects the shape of the bright region even when that “region” is a single pixel.

The plugin-path Bloom node (for applying bloom to 2D textures, video, particles) got rewritten to use the same pipeline via a run_shader_with_format helper that accepts an HDR intermediate format. Same shader, two entry points: one for scenes, one for arbitrary textures.

TAA

Temporal Anti-Aliasing is the technique that finally got me to understand why every game engine has one. Static-camera shots with no TAA have crawling aliasing on every geometric edge. Move a sphere a sub-pixel amount between frames and the pixel that was 60% covered becomes 63%, which becomes 67%, and the edge wobbles. TAA averages across time to get sub-pixel coverage without needing 16× MSAA.

I took the simplest version that works: depth-only reprojection. Each frame jitters the camera projection by a sub-pixel amount following the Halton(2, 3) sequence, with a 16-frame wrap. The TAA pass:

Reads the current frame’s depth.
Reconstructs world position from depth + current inverse view-projection.
Reprojects through the previous frame’s view-projection to get the UVs the pixel had last frame.
Samples the history buffer at those UVs.
Clamps the history sample to the 3×3 YCoCg neighbourhood bounding box of the current frame (this is what rejects ghosts when geometry moves).
Lerps mix(history_clamped, current, alpha) for the output, writes that to both the display and the new history buffer.

The neighbourhood clamp is the load-bearing piece. Without it, reprojection from moving geometry would pull in stale pixels and leave trails. With it, stale samples get rejected whenever they fall outside the plausible colour range of the current neighbourhood, at the cost of one or two ghost frames while the history re-converges.

Full TAA (the kind every AAA engine ships) uses a multi-render-target mesh pass that writes motion vectors per pixel, so you can reproject even on moving geometry with pixel-accurate velocity. I didn’t ship that because adding an MRT to the mesh pass would cache-thrash every other render and the gain was mostly for camera-static scenes with moving objects — a case I’ll optimize when the first real installation patch needs it. Depth-only reprojection is the 80% solution. Possibly the 70% solution. Who’s measuring.

The convergence test loads a static scene, runs 30 frames of TAA, and measures the RMS pixel difference between frame 5 and frame 30. Static scene converges to sub-pixel stability within ~10 frames.

CAS

AMD FidelityFX Contrast Adaptive Sharpening. One fullscreen fragment pass, runs after tonemap in sRGB display space (where the CAS formula is defined), preserves the Rgba8UnormSrgb contract.

Why CAS after TAA? TAA softens the image as a side effect of temporal averaging. Without a sharpening pass afterward, a TAA-enabled render looks slightly muddier than a TAA-disabled one. CAS exists specifically to restore edge crispness without overshooting.

AMD’s CAS is clever for three reasons. First, it’s contrast-adaptive: flat regions get essentially zero sharpening (so noise doesn’t amplify), high-contrast edges get more. Second, it clamps amplification against the local pixel box (so you can’t shoot past the max/min of the 3×3 neighbourhood). Third, it’s cheap — about 8 samples per pixel, no multi-pass setup.

The sharpen strength is a pin (default 0.5 when enabled, 0.0 when disabled). The test suite verifies step-edge sharpening produces a measurably larger gradient on the sRGB-encoded output (~1.02× on the clamp-limited case; tighter edges show more). Flat areas stay within 3 LSB of their input, which is the noise-floor tolerance for an 8-bit path.

Film grain

The cheapest of the four by far. A fullscreen fragment pass that adds per-pixel hash noise to the output. Wang hash over (x, y, frame). Values symmetric around zero, so the average brightness of the image is preserved regardless of grain intensity.

The reason to ship grain at all is that every image the Lux pipeline produces is too clean. Digitally perfect. Zero noise, zero flicker, zero texture in the flat regions. Real footage has grain. Real photography has grain. A scene that’s supposed to feel cinematic needs grain, or it reads as CGI.

Four pins: enabled, intensity (default 0.05 when enabled), size (default 1.0, larger values make chunkier grain), chromatic (default 0.3, how much per-channel noise diverges vs. being monochrome). The default intensity is deliberately subtle — grain is the kind of effect that looks great at 0.05 and terrible at 0.5.

Fragment shader, not compute, because WGSL can’t write to sRGB storage textures and the downstream blit expects Rgba8UnormSrgb. The fragment path preserves that contract via a standard colour attachment, which is why it runs on its own fragment pipeline instead of the compute-pass infrastructure from the compute buffers post.

The full chain

With everything turned on, RenderScene now runs:

Render3D (HDR)
  → TAA (HDR, requires history + previous VP)
  → SceneBloom (HDR, mip chain)
  → SceneTonemap (HDR → sRGB, folds bloom into ACES curve)
  → CAS (sRGB)
  → FilmGrain (sRGB)
  → color output

The framegraph from last post aliases all the transients down to about 4 physical textures regardless of which post-FX are enabled. Turning bloom off removes three mip-chain textures from the plan at compile time. Turning TAA off removes the history buffer. Turning everything off gets you back to the 2-pass Render3D + Tonemap chain from the HDR post, byte-for-byte identical to the old golden images.

Every pass is a pin on RenderScene: taa_enabled, bloom_enabled, sharpen_enabled, film_grain_enabled. Default all-off. You opt in per scene. The inspector groups them under “Post-FX” because the inspector is the right place to discover these, not a modal dialog somewhere.

What it feels like

A scene that was flat and digital before — PBR spheres on a grey plane, lit by a directional light — now looks like a rendered frame. Bloom picks up the specular highlights and softens them into actual glow. TAA kills the edge aliasing on the ring around each sphere. CAS restores the micro-detail TAA stole. Grain gives the flat background a quiet texture that stops it reading as a CGI plane.

This is the post where the 3D render starts looking like the 3D renders people have been trained to expect from the last decade of game engines. Not because any one of these techniques is novel — all four have shipped in a dozen engines — but because for the first time all four are running in Lux, composed by the framegraph, writing against the HDR buffer, and all opt-in per-scene.

The next post is what those shiny spheres are actually made of. Cook-Torrance PBR, metallic/roughness workflow, and the moment the 3D pipeline stopped looking like 2001 and started looking like 2015.

I have no idea what I’m doing or if any of this is right, but it’s fun. Follow along.