NVIDIA recently unveiled its latest GPU architecture called Turing. Turing plays an important in the creation process of a 3D configurator. Although the headlining feature is hardware-accelerated raytracing, Turing also includes some other developments that look quite fascinating in themselves.
If you speak English, we recommend you watch the following video:
One of them is the new concept of mesh shaders, the details of which were released a few weeks ago – and the graphics programming community was aggressive, with lots of enthusiastic discussions on Twitter and elsewhere. What are mesh shaders (or task shaders), why are graphic programmers so enthusiastic about them, and what can we do with them?
The GPU geometry pipeline is overloaded.
The process of overlaying geometry triangles to be drawn to the GPU has a simple underlying paradigm: they put their nodes in a buffer, point the GPU at them, and make a draw call to tell them how many primitives to render. The nodes are slurped linearly out of the buffer, each is processed by a vertex shader, and the triangles are rasterized and shaded.
But over the decades of GPU development, several additional features have been added to the base pipeline in the name of higher performance and efficiency. Indexed triangles and vertex caches have been created to take advantage of vertex reuse. Complex descriptions of the vertex stream format are required to prepare data for shading. Instantiation and later multiple dragging made it possible to combine certain sets of draw calls. Indirect drawings could be created on the GPU itself. Then came the additional shader levels: geometric shaders to enable programmable operations on primitives and even spontaneously insert and delete primitives, and then tesselation shaders to submit a low-res mesh and dynamically divide it to a programmable level.
While these and other features have all been added for good reasons (or at least for what were considered good reasons back then), the combination of them has all become unwieldy. What subset of the many options available do you attack in a particular situation? Will your choice be efficient across all GPU architectures your software needs to run on?
Also, this complex pipeline is still not as flexible as we want it to be and sometimes it is – or, if it is flexible, it is not performant. Instantiation can only draw copies of a single mesh at a time. Multi-Draw is still efficient for a large number of smaller draws. The programming model of the geometry shaders is not suitable for efficient implementation on wide SIMD cores in GPUs and buffering of inputs and outputs is also difficult. Hardware tessellation, although very practical for certain things, is often difficult to handle because the granularity with which you can set the tessellation factors is limited, the number of tessellation modes built in, and performance problems with some GPU architectures.
Simplicity is worth its weight in gold.
Mesh shaders represent a radical simplification of the geometry pipeline. When a mesh shader is activated, all shader levels and functions described above are swept away with a fixed function. Instead, we get a clean, uncomplicated pipeline with a compute shader-like programming model. What’s important is that this new pipeline is both highly flexible enough to handle existing geometry tasks in a typical game, as well as enabling new techniques that are difficult on the GPU today – and it looks like it should be very performance friendly, with no visible architectural barriers to efficient GPU execution.
Like a compute shader, a mesh shader defines workgroups of parallel threads and you can communicate via on-chip shared memory and wave instruments. Instead of a draw call, the app starts a series of mesh shader workgroups. Each workgroup is responsible for creating a small, self-contained part of the geometry called a “meshlet”, which is expressed in arrays of vertex attributes and corresponding indices. These meshlets are then thrown directly into the rasterizer.
The attractive thing about this model is how data-driven and freeform it is. The mesh shader pipeline has very relaxed expectations about the shape of your data and the kind of things you do. Everything is in the programmer’s hands: you can drag the node and index data from the buffers, generate them algorithmically or combine them as you like.
At the same time, the mesh shader model bypasses the problems that geometry shaders have hindered by explicitly taking over the SIMD execution (in the form of the calculation abstraction “workgroup”). Instead of each shader thread generating its own geometry – which leads to divergence and input/output data sizes – the entire working group is allowed to output a meshlet together. This means that we can use compute-style tricks, such as first working a little parallel on the nodes, then having a barrier, and then working parallel on the triangles. It also means that the input/output bandwidth requirements are much more reasonable. And since meshlets are indexed triangle lists, they don’t break the reuse of nodes as geometry shaders often do.
An upgrade path.
The other really nice thing about mesh shaders is that they don’t require you to drastically revise how your game engine handles geometry to use it. It looks like it should be pretty easy to convert the most common geometry types to mesh shaders, making it an accessible upgrade path for developers.
(However, you don’t have to convert everything to mesh shaders immediately. It is possible to switch between the old geometry pipeline and the new mesh shader-based at different points in the frame.)
Suppose you have an ordinary written mesh that you want to load and render. You need to split it into meshlets that have a maximum static size specified in the shader – NVIDIAs recommends 64 nodes and 126 triangles by default.
What to do.
Fortunately, most game engines are currently performing a form of vertex cache optimization that already organizes primitives by locality – triangles that divide one or two nodes will tend to be close together in the index buffer. So a pretty useful strategy for creating meshlets is to simply scan the index buffer linearly and collect the set of nodes used until you reach either 64 nodes or 126 triangles. Reset them and repeat them until you have gone through the entire mesh. This could be done during the art build time or it is also sufficient that it is done during the level load time in the engine.
Alternatively, vertex cache optimization algorithms can probably be modified to create meshlets directly. For GPUs without mesh shader support, you can merge all meshlet vertex buffers and quickly create a traditional index buffer by compensating and concatenating all meshlet index buffers. It’s pretty easy to go back and forth.
In both cases, the mesh shader would usually just act as a vertex shader, with some extra code to get vertex and index data from their buffers and plug it into the mesh outputs.
What about other types of geometry found in games?
Instantiated draws are simple: multiply the number of meshlets and use some shader logic to exclude instance parameters. An interesting case is the Multi-Draw, where we want to draw many meshes that are not all copies of the same thing. For this we can use Task Shaders – a secondary feature of the Mesh Shader pipeline. Task shaders add an additional layer of compute-style workgroups that run before the mesh shader and control how many mesh shader workgroups are started. You can also write output variables to be used by the mesh shader. A very efficient multi-draw should be possible by starting task shaders with one thread per draw, which in turn start the mesh shaders for all draws.
If we have to draw a lot of very small meshes, like quads for particle/imposter/text/point-based rendering or boxes for occlusion tests/projection stickers and what not, we can put a bunch of them in each mesh shader workgroup. The geometry can be generated entirely in shaders instead of relying on a pre-initialized index buffer of the CPU. This was one of the original use cases hoped to be done with geometry shaders (e.g. by submitting point primitives and extending them with the GS in quads). There is also a lot of flexibility to do things with variable topology, like particle beams/stripes/tapes, that would otherwise have to be created either on the CPU or in a separate computational flow.
(By the way, the other original application that hopefully could be done with geoemtrie shaders was multi-view rendering: drawing the same geometry on multiple faces of a cubemap or on slices of a cascading shading map within a single draw call. You could also do this with mesh shaders – but Turing actually has a separate hardware multi-view function for these applications).
What about tesselated meshes?
The two-layer structure of task and mesh shaders is very similar to that of mosaic envelopes and domain shaders. It doesn’t seem that mesh shaders have any access to the fixed-function tessellator unit, but it’s not too hard to imagine that we could write code in task/mesh shaders to reproduce the tessellation functionality. Finding out the details would be a small research project, because maybe someone has already worked on it – and perfection would be very questionable. However, we would have the advantage of being able to change the way tesselation works, instead of sticking to what Microsoft decided in the late 2000s.
New possibilities.
It’s great that mesh shaders can summarize our current geometry tasks and, in some cases, make them more efficient. But mesh shaders also open up possibilities for novel geometry processing that would not have been possible on the GPU before, or would have required expensive computational operations in which data was stored in memory and then read back in via the traditional geometry pipeline.
Since our meshes are already available in mesh form, we can perform finer grains on the mesh plane and even on the triangle plane within each meshlet. With task shaders, we might be able to perform mesh LOD selection on the GPU, and if we feel like doing more, we might even try dynamically merging very small draws (from coarse LODs) to achieve better mesh utilization.
Instead of tile-based forward lighting or as an extension to it, it might be useful to remove lights per meshlet, provided there is a good way to pass the variable size light list from a mesh to a fragment shader.
Accessing the topology in the mesh shader should allow us to calculate dynamic standards, tangents, and curvatures for a mesh that deforms due to complex skinning, displacement mapping, or procedural vertex animations. We can also perform voxel meshing or isosurface extraction marching cubes or tetrahedra, as well as generate normals etc. for the isosurface – directly in a mesh shader, to display liquids and volumetric data.
Geometry for hair/fur, foliage or other surface coverings can be generated in no time at all, with view-dependent details.
3D modeling and CAD applications can apply mesh shaders to dynamically triangulated quad or n-gon meshes, as well as things like dynamic insertion/submission of geometries for visualization.
For the representation of shifted terrain, water, etc., we can support mesh shaders for geometry clip maps and geometry morphing. They could also be interesting for progressive mesh systems.
And last but not least, we might be able to render Catmull-Clark‘s Subdivision Surfaces or other Subdivision schemes more easily and efficiently than is possible on the GPU today.
Surely all these things are possible with the new mesh and task shader pipeline. There will certainly be algorithmic difficulties and architectural obstacles that will arise when graphics programmers have a chance to deal with them. However, we are very excited about what people will do with this feature in the coming years and we hope and expect that it will not remain an exclusive NVIDIA feature for too long.
So we’re done with our contribution to the mesh shader. If you have any questions or comments, please feel free to contact our experts in our forum.
Thank you very much for your visit.