Xbox / PC, early-Z and early stencil in XNA

The sensationalist statement: On the Xbox (using XNA), if you use the stencil buffer to avoid rendering pixels, the pixel shader still ends up being executed for those hidden pixels!

Overview

If you’re using a costly pixel shader (such applying a PCF shadow map) it’s beneficial to only render the pixels you need to. As they say, the quickest pixel to render is the one you don’t.

There are a few ways to avoid the pixel shading cost for hidden pixels. The first obvious way is the depth buffer. If the geometry you’re rendering lies behind other geometry, it won’t show up; and, if you’re lucky, the pixel shader won’t be run.

Another way is to use the stencil buffer. Basically you perform an inexpensive rendering pass that marks individual pixels in the current render target in a certain way. Then, when performing your expensive render pass, you set up the stencil state such that pixels marked in a certain way are skipped – thus ideally avoiding the costly pixel shader.

When the GPU does these tests prior to the pixel shader being run, they are referred to as “early-Z” or ” early stencil”. Unfortunately, sometimes the GPU can’t do these until after the pixel shader has been run. When this happens, you won’t get any performance benefit! The pixel shader will be run even for pixels that end up not being written to the back buffer.

What can cause “early-Z” and “early stencil” to be turned off? It gets pretty complicated, and probably differs a little between different brands of graphics card. For a very in-depth analysis, have a look at this great series of articles about all this on the “rye blog”: A trip through the graphics pipeline.

In addition to early-Z and early stencil, there is hierarchical Z and hierarchical stencil (“hi-Z”, “hi-stencil”). These ideally allow the GPU to abandon large amounts of pixels at once while performing rasterization, rather than need to check each pixel individually.

There are a few well-known scenarios that result in hi-z or early-Z being disabled. One is when you write the DEPTH semantic from your pixel shader. Another is when you’re performing alpha-testing (using the clip() intrinsic in your shader to abandon pixels), commonly done when rendering foliage.

It gets more interesting. Apparently on the Xbox, the only early-Z and early-stencil are hi-Z and hi-stencil (* this is what I’ve gleaned from information publicly available on the internet). If you do something that turns off hi-Z or hi-stencil, you’ll end up incurring the shading cost for all hidden pixels.

There seems to be a lot of confusing information on the internet about early-Z and early-stencil on the Xbox. I’m sure the exact details are available to certified Xbox developers, but they are under NDA. For those of us using the XNA framework, we are out of luck. When early-Z is disabled by an alpha-testing pixel shader, for instance, when does it get re-enabled? At the next draw call? Only when we switch to a new render target?

In practice

I do a number of “bad things” in my game engine. Rendering alpha-tested foliage is one of them. Another happens during the lighting pass in my deferred renderer. I use the depth information from the G-buffer to recreate the Z-buffer by doing a full-screen pass and outputting the DEPTH semantic from the pixel shader (this is an alternative to re-rendering all the geometry). I do this mainly so I can use bounding volumes for point lights. But do I end up hurting performance? If I’m disabling early-Z by doing this, it might not be giving me any performance benefit.

When I recently added forward-shaded water to my lighting pass, I noticed that I incurred no performance penalty (on the Xbox) when the water was obscured by geometry. So clearly early-Z must still be working – this actually kind of surprised me.

The water shader runs pretty much full-screen, but luckily I only incur the performance penalty for the visible portion here.

Both the lighting phase and the forward-rendered water are expensive shaders, because they do 9 tap PCF shadow maps. After my lighting phase, I render the water. The water (despite what it looks like above) renders opaque and obscures the geometry rendered by the lighting phase. So theoretically, I don’t need to render any pixels from the lighting phase which end up underwater. I incur a huge performance penalty right now with a screen-full of water, since I would be essentially doing a 9 tap PCF shadow map comparison twice per pixel.

So I decided to do a water “pre-pass” where I set bits in the stencil buffer. Then, the lighting pass could avoid rendering those pixels. Thus resulted in a decent performance improvement on the PC, but had absolutely benefit on the Xbox (in fact it made things worse due to the extra cost incurred by the pre-pass).

No matter what I did, I couldn’t get the stencil buffer to help with avoiding shading obscured pixels on the Xbox. So I decided to code up a little test app to figure out exactly what the deal was, and also how outputting DEPTH from the pixel shader (or alpha-testing) affects early-Z rejection.

The results

The test performs an expensive full-screen pass: the pixel shader makes 25 texture samples, fairly widely-spaced so as to thrash the texture cache.

The test implements 5 different ways to occlude the costly full-screen pass (and hopefully avoid the pixel-shading cost):

Using the Z-buffer, in a regular geometry pass
Using the Z-buffer in a pass that wrote per-pixel depth
Using the Z-buffer in a pass that used the clip intrinsic (alpha-testing pass)
Using the stencil buffer
Using dynamic branching in the shader

Test	Xbox	PC (GeForce GT 240)	PC (GeForce 8500 GT)	Macbook (Intel 3000)
No occlusion	16.3ms	2.9ms	9.4ms	8.4ms
Regular depth	0.9ms	1.3ms	1.9ms	1.7ms
Per-pixel depth	1.2ms	1.3ms	2.1ms	2.1ms
Clip() shader	1.5ms	1.5ms	1.9ms	2.3ms
Stencil buffer	16.9ms	1.1ms	2.0ms	2.0ms
Dynamic branching	3.5ms	1.9ms	5.1ms	3.8ms

A couple of interesting numbers are highlighted. The Xbox’s hierarchical Z buffer seems to work very well. But even if we (apparently) disable it by writing per-pixel depth or using a clip() shader, we still get early-Z rejection. Wonderful! This kind of contradicts what I’ve read.

The Xbox doesn’t use early stencil rejection at all. Apparently you have to explicitly enable it, but I haven’t found any way to do so in XNA. You can try setting “HiStencilEnable = true;” in your technique in your shader file, and the effect compiler for the Xbox will at least recognize it, but won’t let you use it (try it out).

Early stencil buffer rejection appears to be slightly more efficient than per-pixel depth on the PC.

I originally just tried the Z-buffer and stencil buffer, but then realized that a dynamic branch could be used to accomplish the same thing. The dynamic branching works by first rendering an image to another render target, and then using a shader that samples from this image prior to making (and hopefully avoiding) its 25 other samples.

So dynamic branching could definitely be considered an alternative to stencil-rejection on the Xbox. In my test app the times include the additional cost of the preliminary render and render target resolve mentioned in the previous paragraph, so the 3.5ms is a bit inflated. In the scenario I’m trying to address in my game engine, I believe I already have the necessary information needed to bail from my lighting shader early.

As an aside, it’s interesting to note how much the Xbox is affected by thrashing the texture cache. You don’t see it in the number above, but the performance varies greatly depending on how widely-spread my texture samples are in my shader (look for the SCALE constant in Expensive.fx in the sample app). On the PC it still makes a difference, but not nearly so much. You’ll also note that the numbers for running the full 25-tap shader are worse on the Xbox for any other PC GPU listed, even the GeForce 8500, which is definitely lower-performing in general than the Xbox.

Source code

It’s available here. If you run it on the Xbox, you’ll need a keyboard plugged in (or the chatpad) to turn the different mechanisms on/off.

5 comments on “Xbox / PC, early-Z and early stencil in XNA”

Daniel Dobson (@_Eclectus) September 29, 2012 at 3:38 am Reply

Great Article – I’m interested by how well hierarchical Z works on Xbox, was pretty convinced that writing DEPTH out would break this, yet it hasn’t. Perhaps because the depth is written out as part of a separate pass – I’ll have a look at the source code, cheers!

Its definitely worth knowing that the stencil buffer doesn’t help at all under these circumstances, I suspect that isn’t a widely known fact.
mtnphil September 29, 2012 at 4:28 am Reply

It’s possible it disabled hi-Z, but not early-Z (contradicting stuff I’ve read that says the only early-Z on the Xbox is hi-Z).
Daniel Dobson (@_Eclectus) September 29, 2012 at 7:47 am Reply

I see what you mean yes – (I’d also thought the only Z rejection on Xbox was hierarchical Z)
Mario October 3, 2012 at 3:22 pm Reply

I think the way this works is that early-Z is disabled only for the pass that binds shaders that has the DEPTH semantic, clip(), etc. It is disabled as a state in the pipeline when the driver sets the shader. But it is re-enabled when you use any other “normal” pass. So, in your example, your performance is only degraded when rendering to the G-Buffer, not when sampling from it.

I may be wrong, though.
- mtnphil October 3, 2012 at 6:48 pm Reply
  
  Yes, that appears to be how it works. So probably setting a new shader “resets” the state and allows for early-Z again.

IceFall Games