I finally decided I need a real GPU profiling method on the PC. All performance tuning I’ve done so far is just by turning on and off features and seeing how it affects frame render time.
Given my desktop machine has an NVIDIA GPU, I first tried NVIDIA nSight. I’d already installed this before and quickly uninstalled it as it caused Visual Studio to crash whenever I looked at a shader file (it integrates with visual studio – no way around it). Only a fool expects different results when repeating the same thing, so I guess I am a fool.
After downloading the 32 bit version (since both XNA and Visual Studio are 32 bit apps), it told me I needed to use the 64 bit installer. Ok. There is no guidance on the download page. -1 point NVIDIA.
I downloaded the 64 bit version, which was more that twice the size. After installing it, it was clear it included all the CUDA stuff, even though I choose the version that did not have any CUDA stuff. -2 points NVIDIA.
And of course, just as when I had tried this before, Visual Studio crashed as soon as I opened a shader file. -3 points NVIDIA.
As before, I uninstalled it.
Next I tried NVIDIA PerfHUD, which has been deprecated in favor of nSight. But it looked like it might be useful. I downloaded it and installed it, but it was nowhere to be found. I didn’t get any errors, it just didn’t seem to install anything on my machine. -4 points NVIDIA.
I figured it might be time to ask questions about all this in the NVIDIA support forums – or at least poke around for an answer. Lo and behold, I need a password to even read the forums! -5 points NVIDIA.
I knew I’d signed for an account here before, so dug around on my machine and found my password. It didn’t work. I clicked on a “forgot my password” link, and the NVIDIA website said it had sent an email to me with instructions to reset my password. Well several hours later, and I still have no email from them. -6 points NVIDIA!
Seriously, what a completely awful developer experience. Everything I tried from NVIDIA seemed 100% broken at every step.
In contrast, most of my experience with Intel’s GPU profiling tool (Intel GPA) has been great. The website is clear, and documentation/tutorials on how to use the profiler seems to be abundant (although for some reason the download button doesn’t work in IE 9).
I was able to get it up and running on my machine in a matter of minutes, and probably within 10 minutes after installing I was looking at a trace of a frame capture from my game.
Unfortunately given that I don’t have an Intel GPU on my desktop machine, the metrics it records are fairly limited. I was interested in how much my vertex shader texture samples were impacting performance – but all I could get was PS time and VS time per draw call. And I didn’t quite believe the numbers.
I then installed it on my macbook (which has an Intel GPU), and within a few minutes had a much more detailed trace of a frame capture. There’s a lot of data here, and I’m not sure how to process it yet.
Here’s a quick overview of time spent rendering my frame (I added the red text and arrows).
I have more trust in the numbers here (on the Intel GPU) than I do in Intel’s tool on the NVIDIA GPU.
When I (in the game) “remove” the terrain vertex shader (by changing it to be a simple quad), it reduces frame time by roughly 1ms. And I see in the analysis that the vertex shader took 0.7ms for terrain rendering to the G-buffer, and 0.2ms for terrain rendering to the shadow map. So those numbers pretty much add up.
Why am I even bothering with this though? The pixel shader for terrain completely overwhelms the vertex shader time (21ms to 1ms). While that may be true on the Intel GPU, it definitely isn’t as true on my NVIDIA GPU (which presumably has better texture fetch bandwidth?). And anyway, 1ms is 1ms. If I can reduce it, I may as well.
I was wondering if I could somehow leverage better vertex cache performance. And the numbers here tell me a lot.
My scrolling texture grid is a 96 x 96 grid. That’s 9,409 vertices, and 18,432 triangles.
18,432 triangles is 55,296 vertices, but of course the vertex cache is leveraged so that we actually process a lot less than that. The theoretical minimum we could process is 9,409 (the number of unique vertices), but of course we’re limited by the size of the vertex cache and the ways we can triangulate the terrain grid.
What’s the best I could hope for? Well I’m triangulating my terrain grid “row by row”. I can pretty much forget about the previous row’s processed vertices still being around, so there’s no sharing there. In one row, I have 192 primitives. Each one has 2 vertices in common with the previous primitive, so that’s only one new vertex per primitive, plus starting and end vertices. Seems it should be 195 vertices processed per row. At 96 rows, that’s 18,720 vertex shader runs.
Anyway, the Intel GPA tells you exactly how many times the vertex shader was executed for a draw call. And when rendering terrain to the GBuffer, it’s 14,848 times. That’s not too bad. If the numbers are to be believed, that’s one seriously big vertex cache!
Now, I render the same terrain grid when creating the shadow map. However, in this case the vertex shader is only run 11,166 times! Pretty close to theoretical maximum. Why is it different? Well I’m only guessing, but I suspect it’s because a lot more vertices will fit into the post-transform vertex cache in that case. The terrain vertex shader for shadows outputs 8 floats, while the one for actually rendering the terrain outputs 27 floats!
Just for kicks, I tried triangulating my terrain grid in a slightly different way. I alternate the direction of each row of primitives. That means there is more chance for triangles on different rows to shared processed vertices from the previous row. And the result? I’m now down to 13,665 vertices processed for the G-buffer terrain (a decrease of 1183). But up to 12,542 for the shadow map terrain – an increase of 1362! I certainly didn’t expect that.
So I’m not too sure what to make of this information yet, other than I probably shouldn’t worry about optimizing this stuff too much at this point.
EDIT: actually, the increase does make sense, if the post-transform cache has room for at least a row of vertices (192 in my case).