Geometry

The amount of geometry, as well as the way it is transferred to the GL, can have a very large impact on both CPU and GPU load. If the application sends geometry inefficiently or too frequently, that alone can create a bottleneck on the CPU side that does not give the GPU enough data to work efficiently. Geometry must be submitted in sizable chunks to realize the potential of the GPU. At the same time, geometry should be minimally defined, stored on the server side, and it should be accessed in a way to get the most out of the two GPU caches that exist before and after vertices are transformed.

Use indexed primitives (G1)

The vertex processing engine contains a cache where previously transformed vertices are stored. It is called Post-TnL vertex cache. Taking full advantage of this cache can lead to very large performance improvement when vertex processing is the bottleneck. To fully utilize it, it is necessary for the GPU to be able to recognize previously transformed vertices. This can only be accomplished by specifying indexed primitives for the geometry. However, for any non-trivial geometry the optimal order of indices will not be obvious to the programmer. If some geometry is complex, and the application bottleneck is vertex processing, then look into computing a vertex order that maximizes the hit ratio of the Post TnL cache. The topic has been thoroughly studied for years and even simple greedy algorithms can provide a substantial performance boost. Good results have been reported with the algorithm described at the below locations.


Document	URL to Latest
Linear-Speed Vertex Cache Optimisation, by Tom Forsyth, RAD Game Tools (28th September 2006)	URL

There is a free implementation of the algorithm in a library called vcacne.

Note:

The number of vertex attributes and the size of each attribute may determine the efficiency of this cache—it has storage for a fixed number of bytes or active attributes, not a fixed number of vertices. A lot of attribute data per vertex increases the risk of cache misses, resulting in potentially redundant transformations of the same vertices.

Reduce vertex attribute size and components (G2)

It is important to use an appropriate attribute size and minimize the number of components to avoid wasting memory bandwidth and to increase the efficiency of the cache that stores pre-transformed vertices. This cache is called Pre-TnL vertex cache. For instance, you rarely need to specify attributes in 32 bit FLOATs. It might be possible to define the object-space geometry using 3 BYTEs per vertex for a simple object, or 3 SHORTs for a more complex or larger object. If the geometry requires floating-point representation, half-floats (available in extension OES_vertex_half_float.txt) may be sufficient. Per vertex colors are accurately stored with 3 x BYTEs with a flag to normalize in VertexAttributePointer. Texture coordinates can sometimes be represented with BYTEs or SHORTs with a flag to normalize (if not tiling).

Note:

The exception case that normalizing texture coordinates is not necessary if they are only used to sample a cube map texture.

Vertex normals can often be represented with 3 SHORTs (in a few cases, such as for cuboids, even as 3 BYTEs) and these should be normalized. Normals can even be represented with 2 components if the direction (sign) of the normal is implicit, given its length is known to be 1. The remaining coordinate can be derived in a vertex shader (e.g. z = SQRT(1 - x * x + y * y)) if memory or bandwidth (rather than vertex processing) is a likely bottleneck.

An optimal OpenGL ES application will take advantage of any characteristics specific to the geometry. For instance, a smooth sphere uses the normalized vertex coordinates as normal—these are trivially computed in a vertex shader. It is important to benchmark intermediate results to ensure the vertex processing engine is not already saturated. Finally remember, if some attribute for a primitive or a number of primitives is constant for the same draw call, then disable the particular vertex attribute index and set the constant value with VertexAttrib instead of replicating the data.

Pack vertex attributes (G3)

Vertex attributes normally have different sets of attributes that are completely unrelated. Unlike uniform and varying variables in shader programs, vertex attributes do not get automatically packed, and the number of vertex attributes is a limited resource. Failure to pack these attributes together may lead to limitations sooner than expected. It is more efficient to pack the components into fewer attributes even though they may not be logically related. For instance, if each vertex comes with two sets of texture coordinates for multi-texturing, these can often be combined these into one attribute with four components instead of two attributes with two components. Unpacking and swizzling components is rarely a performance consideration.

Choose an appropriate vertex attribute layout (G4)

There are two commonly used ways of storing vertex attributes:

Array of structures
Structures of arrays

An array of structures stores the attributes for a given vertex sequentially with an appropriate offset for each attribute and a non-zero stride. The stride is computed from the number of attribute components and their sizes. An array of structures is the preferred way of storing vertex attributes due to more efficient memory access. If the vertex attributes are constant (not updated in the render loop) there is no question that an array of structures is the preferred layout.

In contrast, a structure of arrays stores the vertex attributes in separate buffers using the same offset for each attribute and a stride of zero. This layout forces the GPU to jump around and fetch from different memory locations as it assembles the needed attributes for each vertex. The structure of arrays layout is therefore less efficient than an array of structures in most cases. The only time to consider a structure of arrays layout is if one or more attributes must be updated dynamically. Strided writes in array of structures can be expensive relative to the number of bytes modified. In this scenario, the recommendation is to partition the attributes such that constant and dynamic attributes can be read and written sequentially, respectively. The attributes that remain constant should be stored in an array of structures. The attributes that are updated dynamically should be stored in smaller separate buffer objects (or perhaps just a single buffer if the attributes are updated with the same frequency).

Use consistent winding (G5)

The geometry winding (clockwise or counter-clockwise) should be determined up front and defined in code. The geometry face that is culled by GL can be changed with the FrontFace function, but having to switch back and forth between winding for different geometry batches during rendering is not optimal for performance and can be avoided in most cases.

Always use vertex and index buffer objects (G6)

Recall that vertices for geometry can either be sourced from application memory every time it is rendered or from buffers in graphics memory where it has been stored previously. The same applies to vertex array indices. To achieve good performance, you should never continuously source the data from application memory with DrawArrays. Buffer objects should always be used to store both geometry and indices. Check that no code is calling DrawArrays, and that no code is calling DrawElements without a buffer bind.

The default buffer usage flag when allocating buffer objects is STATIC_DRAW. In many cases this will lead to fastest access.

Note:

STATIC_DRAW does not mean one can never write to the buffer (although any writing to a buffer should always be avoided as much as possible). A STATIC_DRAW flag may in fact be the appropriate usage flags, even if the buffer contents are updated every few frames. Only after careful benchmarking and arriving at conclusive results should changing the usage flag to one of the alternatives (DYNAMIC_DRAW or STREAM_DRAW) be considered.

Batch geometry into fewer buffers and draw calls (G7)

There are only so many draw calls, or batches of geometry, that can be submitted to GL before the application becomes CPU bound. Each draw call has an overhead that is more or less fixed. Therefore, it is very important to increase the sizes of batches whenever possible. There does not need to be a one-to-one correspondence between a draw call and a buffer—a large vertex buffer can store geometry with a similar layout for multiple models. One or more index buffers can be used to select the subset of vertices needed from the vertex buffer. A common mistake is to have too many small buffers, leading to too many draw calls and thus high CPU load. If the number of draw calls issued in any given frame goes into many hundreds or thousands, then it is time to consider combining similar geometry in fewer buffers and use appropriate offsets when defining the attribute data and creating the index buffer.

Unconnected geometry can be stitched together with degenerate triangles (alternatively, by using extension NV_primitive_restart2 when available). Degenerate triangles are triangles where two or more vertices are coincident leading to a null surface. These are trivially rejected and ignored by the GPU. The benefit from stitching together geometry with degenerate triangles, such that fewer and larger buffers are needed, tends to outweigh the minor overhead of sending degenerates triangles down the pipeline. If geometry batches are being broken up to bind different textures, then look at combining several images into fewer textures (T5).

Use the smallest possible data type for indices (G8)

When the geometry uses relatively few vertices, an index buffer should specify vertices using only UNSIGNED_BYTE instead of UNSIGNED_SHORT (or an even larger integer type if the ES2 implementation supports it). Count the number of unique vertices per buffer and choose the right data type. When batching geometry for several unrelated models into fewer buffer objects (G7), then a larger data type for the indices may be required. This is not a concern compared to the larger performance benefits of batching.

Avoid allocating new buffers in the rendering loop (G9)

If the application frequently updates the geometry, then allocate a set of sufficiently large buffers when the application initializes. A BufferData call with a NULL data pointer will reserve the amount of memory you specify. This eliminates the time spent waiting for an allocation to complete in the rendering loop. Reusing pre-allocated buffers also helps to reduce memory fragmentation.

Note:

Writing to a buffer object that is being used by the GPU can introduce bubbles in the pipeline where no useful work is being done. To avoid reducing throughput when updating buffers, consider cycling between multiple buffers to minimize the possibility of updating the buffer from which content is currently being rendered.

Cull early and often (G10)

The GPU will not rasterize primitives when all of its vertices fall outside the viewport. It also avoids processing hidden fragments when the depth test or stencil test fails (P4). However, this does not mean that the GPU should do all the work in deciding what is visible. In the case of vertices, they need to be loaded, assembled and processed in the vertex shader, before the GPU can decide whether to cull or clip parts of the geometry. Testing a single, simple bounding volume that encloses the geometry against the current view frustum on the CPU side is a lot faster than testing hundreds of thousands of vertices on the GPU. If an application is limited by vertex processing, this is definitely the place to begin optimizing. Spheres are the most efficient volumes to test against and the volume of choice if geometry is rotational symmetrical. For some geometry, spheres tend to lead to overly conservative visibility acceptance. A rectangular cuboid (box) is only a slightly more expensive test but can be made to fit more tightly on most geometry. Hierarchical culling can often be employed to reduce the number of tests necessary on the CPU. Efficient and advanced culling algorithms have been heavily researched and published for many years. You can find a good introduction in the survey at the below locations.