Shader Programs

Writing efficient shaders is critical to achieving good performance. One should treat shaders like pieces of code that run in the inner-most loops on a CPU. There is a very high cost to littering these with conditionals or recomputing loop invariants. Before optimizing an expensive vertex shader, make sure geometry that is entirely outside the view frustum is being culled on the CPU. Before optimizing an expensive fragment shader, make sure the application is not generating an excess number of fragments with it.

Note:

When optimizing shaders, any source code that does not contribute to its output variables is optimized out by the compiler. This feature can be exploited to gain knowledge about whether the shader is part of the current bottleneck by multiplying the output variable with a null vector to reduce the workload and then measure if frame rate improves. Conversely, at the final stages of optimization one can quickly measure if there is headroom for increasing workload to offload computations to shader unit or to improve image quality by adding meaningless but expensive ALU instructions, or texture sampling, to the output variables.

Move computations up the pipeline (P1)

As the rendering pipeline is traversed from the CPU to the vertex processor and then to the fragment processor, the required workload tends to increase a few orders of magnitude each time. Computations constant per model, per primitive or per vertex do not belong in the fragment processor and should be moved up to the vertex processor or earlier. Per draw call computations do not belong in the vertex processor and should be moved to the CPU. For instance, if lighting is done in eye-space, the light vector should be transformed into eye-space and stored in a uniform rather than repeating this for each vertex or, even worse, per fragment. The light vector should naturally be stored pre-normalized. Usually the light vector computations are constant for the draw call, so they do not belong in any shader.

Do not write large or generalized shaders (P2)

It is critical to resist the temptation to write shader programs that take different code paths depending on whether one or more constant variable have a particular value. Uniforms are intended as constants for one (or hopefully many) primitives—they are not substitutes for calling UseProgram. Shaders should be minimal and specialized to the task they perform. It is much better to have many small shaders that run fast than a few large shaders that all run slow. Code re-use (when source shaders are supported) should be handled at the application level using the ShaderSource function. If the advice here of not writing generalized shaders goes against the conflicting goal of minimizing shader and state changes, smaller and more specialized shaders are generally preferred. Additionally, be careful with writing shader functions intended for concatenation into the final shader source code - shared functions tend to be overly generic and make it harder to exploit possible shortcuts.

Take advantage of application specific knowledge (P3)

Application specific knowledge can be used to simplify or avoid computations. Math shortcuts should be pursued everywhere because there are optimizations that the shader compiler or the GPU cannot make. For instance, rendering screen-aligned primitives is common for 2D user interface and post-processing effects. In this case, the entire modelview transformation is avoided by defining the vertices in NDC (normalized device coordinates). A full-screen quad has its vertex coordinates in the [-1.0,1.0] range so these can be passed directly from the vertex attribute to gl_Position. The types of matrix transformations applied in the application when creating the modelview matrix should be tracked and exploited when possible. For instance, an orthonormal matrix (e.g. no non-uniform scaling) leads to an opportunity to avoid computations when transforming normals with the inverse-transpose sub-matrix.

Optimize for depth and stencil culling (P4)

The GPU can quickly reject fragments based on depth or stencil testing before the fragment shader is executed. The depth complexity of a scene is the number of times each fragment gets written. Depth complexity can be measured by incrementing values in a stencil buffer. A high depth complexity in a 3D scene can be a result of rendering opaque objects in a non-optimal order. The worst case is rendering back-to-front (aka painter's algorithm) because it leads to a large number of fragments being overdrawn. An application with high depth complexity should ensure that opaque objects are rendered sorted front-to-back order with depth testing enabled. Straightforward rendering of 2D user interfaces also leads to a high depth complexity that can often be decreased with the same technique but also by using the stencil buffer to mask fragments. Applications that are heavily fragment limited can be sped up significantly with clever use of these techniques—sometimes up to a factor of 10 or more.

If vertex processing is not a bottleneck, it is worthwhile to run experiments that prime the depth buffer in a first pass. Disable all color writes with ColorMask on the first pass. The fragments in the depth buffer can then serve as occluders in a second pass when color writes are enabled and the expensive fragment shaders are executed. Disable depth writes with DepthMask in the second pass since there is no point in writing it twice.

Do not discard fragments, or modify depth, unless absolutely necessary (P5)

Some operations prevent the hardware from enabling its automatic optimization that rejects fragments early in the pipeline (early-Z). In particular, the discard operation that discards fragments based on some criteria will disable early-Z on some platforms. It is critical to limit the use of discarding as much as possible (e.g., alpha testing)—unless depth writing can be disabled. Another example is found in the GL_NV_fragdepth extension available on some platforms, where the depth value can be written from the fragment shader. This operation also forces the GPU to opt out of Early-Z reject in order to ensure correct rendering.

Avoid conditionals in shaders when possible (P6)

Fragments are processed in chunks and both branches of a conditional may need to be evaluated before the result of the false branch can be discarded by the GPU. Be careful with assuming that conditionals skip computations and reduce the workload. This warning is particularly relevant to fragment shaders. Benchmarking shaders can determine if conditionals in the vertex or fragment shaders actually end up decreasing the workload. Some conditional code can be rewritten in terms of algebra and/or built-in functions. For instance, the dot product between a normal and a light vector may be negative in which case the result is not needed in a lighting equation. Instead of:

if (nDotL > 0.0) ...

the value can be clamped with:

clamp(nDotL, 0.0, 1.0)

and unconditionally used in the result (the negative value results in a zero-product). Clamp may be faster than max and/or min for the 0.0 and 1.0 cases, but as always benchmarking will have the final say in the matter. Another reason to make an effort of avoiding conditionals in fragment shaders is that mipmapped textures return undefined results when executed in a block statement that is conditional on run-time values. Although the GLSL functions texture*Lod can be used to bias or specify the mipmap LOD, it is expensive to manually derive the mipmap LOD. In addition, these LOD biasing samplers may not run as fast as the non-LOD samplers.

Use appropriate precision qualifiers (P7)

Recall that the default precision in vertex shaders is highp, and that fragment shaders have no default precision until explicitly set. Precision qualifiers can be valuable hints to the compiler to reduce register pressure and may improve performance for several reasons. Low precision may run twice as fast in hardware as high precision. These optimizations can be approached by initially using highp and gradually reducing the precision to lowp until rendering artifacts appear; if it looks good enough, then it is good enough. As a rule of thumb, vertex position, exponential and trigonometry functions need highp. Texture coordinates may need anything from lowp to highp depending on texture width and height. Many application-defined uniform variables, interpolated colors, normals and texture samples can usually be represented using lowp. However, floating-point texture samplers need more than low precision - this is one of several reasons to minimize the use of floating-point textures (T6).

Use the built-in functions and variable (P8)

The built-ins have a higher chance of compiling optimally and may even be implemented in hardware. For instance, do not write shader code to determine the forward primitive face or compute the reflection vector in terms of dot-products and algebra; instead, use the built-in variable gl_FrontFacing or the built-in function reflect, respectively.

Consider encoding complex functions in textures (P9)

Shaders normally contain both arithmetic (ALU) and texture operations. A batch of ALU operations may hide the latency of fetching samples from texture because they occur in parallel. If a shader is the primary bottleneck, and when the ALU operations significantly outnumber the texture operations, it is worthwhile to investigate if some of these operations can be encoded in textures. Sub-expressions in shaders can sometimes be stored in LUTs (look-up-tables). LUTs can sometimes be implemented in textures with sufficient precision and accessed as 1D or 2D textures with NEAREST filtering.

Note:

The old trick of using cubemaps to normalize vectors is most likely a performance loss on discrete GPUs. If you pursue this idea, then make sure to benchmark to determine if you have improved or worsened the performance!

Limit the amount of indirect texturing (P10)

Indirect texturing can sometimes be useful, but when the result of a texture operation depends on another texture operation, the latency of texture sampling is difficult to hide. It also tends to lead to scattered reads that minimize the benefit of the texture cache. Indirect texturing can sometimes be reduced, or avoided, at the expense of memory. Whether that trade-off makes sense should of course be analyzed and benchmarked.

Do not let GLSL syntax obscure math optimizations (P11)

The GLSL shading language conveniently overloads arithmetic operators for vectors and matrices. Care must be taken to not miss optimization opportunities due to this syntax simplification. For instance, a rotation matrix can, but should not, be defined as homogenous 4x4 matrix just because the other operand is a vec4 vector. A generalized rotation matrix should be described only as a 3?x?3 matrix and applied to vec3 vectors. And a rotation around basic vectors can be done even more efficiently than mat3 * vec3 by directly accessing the relevant vector and matrix components with cos and sin. Take advantage of any specific application knowledge to reduce the number of needed scalar instructions.

Only normalize vectors when it is necessary (P12)

Normalization of vectors is required to efficiently calculate the angle between them, or perhaps more typically, the diffuse component in many commonly used lighting equations. In theory, normalization is required for geometry normals after having transformed them with the normal matrix. In practice, it may not matter depending on the composition of the matrix. It is a common mistake to normalizing vectors where it is not necessary, or at least not visually discernible. As a result, the application may run slower for no good reason. For instance, consider the case of interpolating vertex normals in order to compute lighting per fragment (i.e., Phong shading). If the normal matrix only rotates, there is little reason to normalize the normal vectors before interpolating.

Note:

Barycentric interpolation will not preserve the unit length of these vectors. So normals that are interpolated in varying variables do must be normalized to ensure the dot-product with the light vector obeys the cosine emission law.