I’ve always assumed that the term “dependent texture read” referred to the case when the texture coordinates used to fetch from one texture were calculated using the results of a previous texture fetch. This is a fairly obvious performance problem, as texture reads can be slower than ALU operations (especially if the texture is not in the cache). Thus the shader may “stall” waiting for the results it needs to proceed.
In a few of my game’s levels, I noticed that the frame rate was below 60 on the iPad 2, and that moving objects’ motion looked chunky. I put some Stopwatch counters around portions of my Update and Draw methods and looked at the values in the debugger (Xamarin studio). The numbers didn’t really make sense (like 60ms for a single Update cycle), so I assumed this was some artifact of debugging on an iOS device. Then I ported some performance measuring code I had in another project (which displays various metrics in the actual game) and had a look.
The Update and Draw cycles took about 6ms each in the worst case, so the performance bottleneck was not on the CPU (incidentally, this is about 20x slower than on my PC). I also noticed that the Draw cycle was frequently being skipped (XNA/MonoGame does this when the GPU can’t keep up).
So my bottleneck was definitely on the GPU. The problems seemed to occur on the levels with lots of “bricks”. So I assumed it was one of three things:
- Simply too much overdraw (the bricks are drawn over a background)
- The shader used to draw the bricks is too expensive
- I was using too much bandwidth sending over the brick vertices every frame (they aren’t drawn from a vertex buffer)
Making the bricks really small make the perf problem go away. So that ruled out (3), and suggested either (1) or (2) was a problem.
The brick shader is special – it combines two textures: a bricky background and a moss foreground. I subvert the color channel to pass in extra information that allows me to muck around with the texture coordinates I pass to the moss texture fetch. Really, I’m just doing this because all the rendering in this game goes through XNA/MonoGame’s SpriteBatch, which uses a fixed Position/Color/TextureCoordinate vertex format. I am lazy and wanted to avoid creating a new rendering code path – thus this hack.
But basically my shader does:
- fetch from brick texture
- calculate moss coordinates
- fetch from moss texture
- multiply the two together
I tried removing step 2 from the shader, and suddenly the perf problem went away. It was just a handful of calculations, which should be no big deal (perhaps even hidden by the latency of the brick texture fetch), but it made a big difference performance-wise.
I was confused, so then I tried some of Xcode’s performance measuring tools. They were super easy to use (I was getting results within 30 seconds of opening the tool). One of them spits out potential performance problems.
It listed “dependent texture sampling” for the call that draws the background image. The background image is drawn with a very similar shader as the bricks: dual texture with some calculations for one of the texture coordinates. Further research showed that dependent texture reads can also refer to any calculations done for texture coordinates in the pixel shader. On Apple’s website I found the following:Dependent texture reads are supported at no performance cost on OpenGL ES 3.0–capable hardware; on other devices, dependent texture reads can delay loading of texel data, reducing performance. When a shader has no dependent texture reads, the graphics hardware may prefetch texel data before the shader executes, hiding some of the latency of accessing memory.
The iPad 2 has OpenGL ES 2.0 hardware, I believe. So I’m subject to this limitation. This isn’t really something I ever had to deal with when developing for PCs or the Xbox 360. I’m guessing this has something to do with a more primitive texture cache on less-advanced GPUs.
The fix was theoretically simple: just move the texture coordinate calculations to the vertex shader. Unfortunately this wasn’t possible in my scenario, so I ended up having to “do it properly” and basically re-implement a subset of XNA’s SpriteBatch and plumb through the extra pair of texture coordinates. With a proper set of texture coordinates passed into the shader I no longer need to do extraneous calculations in the pixel shader. Performance is back up above 60FPS.