The sensationalist statement: On the Xbox (using XNA), if you use the stencil buffer to avoid rendering pixels, the pixel shader still ends up being executed for those hidden pixels!
If you’re using a costly pixel shader (such applying a PCF shadow map) it’s beneficial to only render the pixels you need to. As they say, the quickest pixel to render is the one you don’t.
There are a few ways to avoid the pixel shading cost for hidden pixels. The first obvious way is the depth buffer. If the geometry you’re rendering lies behind other geometry, it won’t show up; and, if you’re lucky, the pixel shader won’t be run.
Another way is to use the stencil buffer. Basically you perform an inexpensive rendering pass that marks individual pixels in the current render target in a certain way. Then, when performing your expensive render pass, you set up the stencil state such that pixels marked in a certain way are skipped – thus ideally avoiding the costly pixel shader.
When the GPU does these tests prior to the pixel shader being run, they are referred to as “early-Z” or ” early stencil”. Unfortunately, sometimes the GPU can’t do these until after the pixel shader has been run. When this happens, you won’t get any performance benefit! The pixel shader will be run even for pixels that end up not being written to the back buffer.
What can cause “early-Z” and “early stencil” to be turned off? It gets pretty complicated, and probably differs a little between different brands of graphics card. For a very in-depth analysis, have a look at this great series of articles about all this on the “rye blog”: A trip through the graphics pipeline.
In addition to early-Z and early stencil, there is hierarchical Z and hierarchical stencil (“hi-Z”, “hi-stencil”). These ideally allow the GPU to abandon large amounts of pixels at once while performing rasterization, rather than need to check each pixel individually.
There are a few well-known scenarios that result in hi-z or early-Z being disabled. One is when you write the DEPTH semantic from your pixel shader. Another is when you’re performing alpha-testing (using the clip() intrinsic in your shader to abandon pixels), commonly done when rendering foliage.
It gets more interesting. Apparently on the Xbox, the only early-Z and early-stencil are hi-Z and hi-stencil (* this is what I’ve gleaned from information publicly available on the internet). If you do something that turns off hi-Z or hi-stencil, you’ll end up incurring the shading cost for all hidden pixels.
There seems to be a lot of confusing information on the internet about early-Z and early-stencil on the Xbox. I’m sure the exact details are available to certified Xbox developers, but they are under NDA. For those of us using the XNA framework, we are out of luck. When early-Z is disabled by an alpha-testing pixel shader, for instance, when does it get re-enabled? At the next draw call? Only when we switch to a new render target?
I do a number of “bad things” in my game engine. Rendering alpha-tested foliage is one of them. Another happens during the lighting pass in my deferred renderer. I use the depth information from the G-buffer to recreate the Z-buffer by doing a full-screen pass and outputting the DEPTH semantic from the pixel shader (this is an alternative to re-rendering all the geometry). I do this mainly so I can use bounding volumes for point lights. But do I end up hurting performance? If I’m disabling early-Z by doing this, it might not be giving me any performance benefit.
When I recently added forward-shaded water to my lighting pass, I noticed that I incurred no performance penalty (on the Xbox) when the water was obscured by geometry. So clearly early-Z must still be working – this actually kind of surprised me.
Both the lighting phase and the forward-rendered water are expensive shaders, because they do 9 tap PCF shadow maps. After my lighting phase, I render the water. The water (despite what it looks like above) renders opaque and obscures the geometry rendered by the lighting phase. So theoretically, I don’t need to render any pixels from the lighting phase which end up underwater. I incur a huge performance penalty right now with a screen-full of water, since I would be essentially doing a 9 tap PCF shadow map comparison twice per pixel.
So I decided to do a water “pre-pass” where I set bits in the stencil buffer. Then, the lighting pass could avoid rendering those pixels. Thus resulted in a decent performance improvement on the PC, but had absolutely benefit on the Xbox (in fact it made things worse due to the extra cost incurred by the pre-pass).
No matter what I did, I couldn’t get the stencil buffer to help with avoiding shading obscured pixels on the Xbox. So I decided to code up a little test app to figure out exactly what the deal was, and also how outputting DEPTH from the pixel shader (or alpha-testing) affects early-Z rejection.
The test performs an expensive full-screen pass: the pixel shader makes 25 texture samples, fairly widely-spaced so as to thrash the texture cache.
The test implements 5 different ways to occlude the costly full-screen pass (and hopefully avoid the pixel-shading cost):
- Using the Z-buffer, in a regular geometry pass
- Using the Z-buffer in a pass that wrote per-pixel depth
- Using the Z-buffer in a pass that used the clip intrinsic (alpha-testing pass)
- Using the stencil buffer
- Using dynamic branching in the shader
|Test||Xbox||PC (GeForce GT 240)||PC (GeForce 8500 GT)||Macbook (Intel 3000)|
A couple of interesting numbers are highlighted. The Xbox’s hierarchical Z buffer seems to work very well. But even if we (apparently) disable it by writing per-pixel depth or using a clip() shader, we still get early-Z rejection. Wonderful! This kind of contradicts what I’ve read.
The Xbox doesn’t use early stencil rejection at all. Apparently you have to explicitly enable it, but I haven’t found any way to do so in XNA. You can try setting “HiStencilEnable = true;” in your technique in your shader file, and the effect compiler for the Xbox will at least recognize it, but won’t let you use it (try it out).
Early stencil buffer rejection appears to be slightly more efficient than per-pixel depth on the PC.
I originally just tried the Z-buffer and stencil buffer, but then realized that a dynamic branch could be used to accomplish the same thing. The dynamic branching works by first rendering an image to another render target, and then using a shader that samples from this image prior to making (and hopefully avoiding) its 25 other samples.
So dynamic branching could definitely be considered an alternative to stencil-rejection on the Xbox. In my test app the times include the additional cost of the preliminary render and render target resolve mentioned in the previous paragraph, so the 3.5ms is a bit inflated. In the scenario I’m trying to address in my game engine, I believe I already have the necessary information needed to bail from my lighting shader early.
As an aside, it’s interesting to note how much the Xbox is affected by thrashing the texture cache. You don’t see it in the number above, but the performance varies greatly depending on how widely-spread my texture samples are in my shader (look for the SCALE constant in Expensive.fx in the sample app). On the PC it still makes a difference, but not nearly so much. You’ll also note that the numbers for running the full 25-tap shader are worse on the Xbox for any other PC GPU listed, even the GeForce 8500, which is definitely lower-performing in general than the Xbox.
It’s available here. If you run it on the Xbox, you’ll need a keyboard plugged in (or the chatpad) to turn the different mechanisms on/off.