1 Comment

Some random HLSL optimizations

This post just goes over some things I noticed while doing some optimization in my vegetation/wind vertex shader. I mentioned in a previous post that it was 62 instructions. The addition of a few more features and the incorporation of instancing support brought it up to 85 instructions. This is a fairly complex shader for something that’s run tens of thousands of time per frame. With a few hours spent, I got it back down to 66 instructions and made similar optimization in a few other shaders.

Beware of %

In a number of places I was using the % operator with 1 as an argument. The fmod intrinsic offers the same functionality, and in my scenarios using % 1 added 6 additional instructions per use. The compiler is not always able to optimize this (sometimes it does though). By switching to fmod, I brought my precipitation shader from 69 instructions down to 57.

Skip normalize… if you can

In my vertex shader I’m normalizing the normal after multiplying it by the World matrix (normalize(mul(input.Normal, World)). Removing the normalize would save four dp3 instructions, but you can only do this if you can ensure your matrix is orthonormal – which means no scaling (only translation and rotation). Unfortunately in my scenario it isn’t currently, since I do scale my vegetation. I may be able to get away with removing the scale from the World matrix and applying it separately, as long as I’m scaling equally in all directions (I haven’t tried this yet).

Note that you do need to re-normalize in the pixel shader since linear interpolations between two normals aren’t themselves guaranteed to be unit vectors.

Dot product

I was adding three components of a vector together in order to determine the phase for controlling per-leaf vegetation bending and coloring.

fObjPhase = objectPosition.x + objectPosition.y + objectPosition.z;

Instead, by using dot I can save 1 instruction:

fObjPhase = dot(objectPosition.xyz, 1);

Beware of extraneous calculations

In addition to requiring wind direction (a 2-component vector), I also needed the scalar wind strength value. I was doing the following in my vertex shader:

float windStrength = length(instanceWindDirectionAndStrength);

This actually compiles to four instructions (it needs to do a square root, etc..). This was instance data stored in a second vertex stream. To save my 4 instructions I just ended up calculating this data once on the CPU and adding it to the vertex declaration. Note that this kind of “solution” could actually hurt performance. If increasing the size of the vertices impacted vertex cache performance, that could end up more than compensating for the shorter vertex shader. In my case though, these vertices represented instance data, so they are only fetched once per object drawn.

Help the pre-shader

For values which are calculated from shader constants that remain the same across all vertices for a draw call, the pre-shader can pre-calculate these on the CPU. Sometimes the compiler isn’t very good at figuring these out.

When doing lighting calculations and reading from my G-buffer, I had the following line of code to reconstruct world-space depth from view-space depth:

return -helperRay * (depthValue * (NearPlane - FarPlane) - NearPlane) / FarPlane;

Looking at the assembly, it seemed this was turning into a large number of instructions for what it does. By “doing the algebra” myself, I came up with this equivalent:

	float c1 = -NearPlane / FarPlane;
	float c2 = (NearPlane - FarPlane) / FarPlane;
	return -helperRay * (depthValue * c2 + c1);


This was enough to help the compiler realize that an additional value was constant, and this saved 1 instruction (which moved to the pre-shader).

A more dramatic example was in my vegetation shader where I do this:

	float4 viewPosition = mul(worldPosition, View);
	output.ViewSpaceDepth = viewPosition.z;
	output.Position = mul(viewPosition, Projection);

There are two matrix multiplies here (8 dp3 instructions). But we really only need z from the intermediate value. By simply re-writing the code like so saved 4 instructions:

	float4x4 vp = mul(View, Projection);
	float4 viewPosition = mul(worldPosition, View);
	output.ViewSpaceDepth = viewPosition.z;
	output.Position = mul(worldPosition, vp);


Of course this added 28 instructions to the pre-shader (to multiply two full matrices), but these are only executed once.


One comment on “Some random HLSL optimizations

  1. I should note that pre-shaders aren’t supported on the Xbox, so some of these optimizations may not apply!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Space Quest Historian

Let's Play's, Podcasts, and General Adventure Game Goodness

Harebrained Schemes

Developer's blog for IceFall Games

kosmonaut games

Development blog of "Bounty Road"


Turn up the rez!

bitsquid: development blog

Developer's blog for IceFall Games

Game Development by Sean

Developer's blog for IceFall Games

Lost Garden

Developer's blog for IceFall Games


Developer's blog for IceFall Games

Casey's Blog

Developer's blog for IceFall Games


Developer's blog for IceFall Games

Rendering Evolution

Developer's blog for IceFall Games

Simon schreibt.

Developer's blog for IceFall Games

Dev & Techno-phage

Do Computers Dream of Electric Developper?

- Woolfe -

Developer's blog for IceFall Games

Ferrara Fabio

Game & Application Developer, 3D Animator, Composer.

Clone of Duty: Stonehenge

First Person Shooter coming soon to the XBOX 360

Low Tide Productions

Games and other artsy stuff...


Just another WordPress.com site

Sipty's Writing

Take a look inside the mind of a game developer.

Jonas Kyratzes

Writer, game designer, filmmaker.

%d bloggers like this: