This post just goes over some things I noticed while doing some optimization in my vegetation/wind vertex shader. I mentioned in a previous post that it was 62 instructions. The addition of a few more features and the incorporation of instancing support brought it up to 85 instructions. This is a fairly complex shader for something that’s run tens of thousands of time per frame. With a few hours spent, I got it back down to 66 instructions and made similar optimization in a few other shaders.
Beware of %
In a number of places I was using the % operator with 1 as an argument. The fmod intrinsic offers the same functionality, and in my scenarios using % 1 added 6 additional instructions per use. The compiler is not always able to optimize this (sometimes it does though). By switching to fmod, I brought my precipitation shader from 69 instructions down to 57.
Skip normalize… if you can
In my vertex shader I’m normalizing the normal after multiplying it by the World matrix (normalize(mul(input.Normal, World)). Removing the normalize would save four dp3 instructions, but you can only do this if you can ensure your matrix is orthonormal – which means no scaling (only translation and rotation). Unfortunately in my scenario it isn’t currently, since I do scale my vegetation. I may be able to get away with removing the scale from the World matrix and applying it separately, as long as I’m scaling equally in all directions (I haven’t tried this yet).
Note that you do need to re-normalize in the pixel shader since linear interpolations between two normals aren’t themselves guaranteed to be unit vectors.
I was adding three components of a vector together in order to determine the phase for controlling per-leaf vegetation bending and coloring.
fObjPhase = objectPosition.x + objectPosition.y + objectPosition.z;
Instead, by using dot I can save 1 instruction:
fObjPhase = dot(objectPosition.xyz, 1);
Beware of extraneous calculations
In addition to requiring wind direction (a 2-component vector), I also needed the scalar wind strength value. I was doing the following in my vertex shader:
float windStrength = length(instanceWindDirectionAndStrength);
This actually compiles to four instructions (it needs to do a square root, etc..). This was instance data stored in a second vertex stream. To save my 4 instructions I just ended up calculating this data once on the CPU and adding it to the vertex declaration. Note that this kind of “solution” could actually hurt performance. If increasing the size of the vertices impacted vertex cache performance, that could end up more than compensating for the shorter vertex shader. In my case though, these vertices represented instance data, so they are only fetched once per object drawn.
Help the pre-shader
For values which are calculated from shader constants that remain the same across all vertices for a draw call, the pre-shader can pre-calculate these on the CPU. Sometimes the compiler isn’t very good at figuring these out.
When doing lighting calculations and reading from my G-buffer, I had the following line of code to reconstruct world-space depth from view-space depth:
return -helperRay * (depthValue * (NearPlane - FarPlane) - NearPlane) / FarPlane;
Looking at the assembly, it seemed this was turning into a large number of instructions for what it does. By “doing the algebra” myself, I came up with this equivalent:
float c1 = -NearPlane / FarPlane; float c2 = (NearPlane - FarPlane) / FarPlane; return -helperRay * (depthValue * c2 + c1);
This was enough to help the compiler realize that an additional value was constant, and this saved 1 instruction (which moved to the pre-shader).
A more dramatic example was in my vegetation shader where I do this:
float4 viewPosition = mul(worldPosition, View); output.ViewSpaceDepth = viewPosition.z; output.Position = mul(viewPosition, Projection);
There are two matrix multiplies here (8 dp3 instructions). But we really only need z from the intermediate value. By simply re-writing the code like so saved 4 instructions:
float4x4 vp = mul(View, Projection); float4 viewPosition = mul(worldPosition, View); output.ViewSpaceDepth = viewPosition.z; output.Position = mul(worldPosition, vp);
Of course this added 28 instructions to the pre-shader (to multiply two full matrices), but these are only executed once.