With Unity 2017.2 (ideal to create 3D configurators), stereo support instancing for XR devices on DX11 was released, giving developers access to even more performance enhancements for HTC Vive, Oculus Rift and the brand new Windows Mixed Reality headsets. In this article, we’ll take the opportunity to tell you more about the exciting evolution of rendering and how you can take advantage of it.
Short history.
One of the unique and most obvious aspects of XR rendering is the need to generate two views, one per eye. We need these two views to create the stereoscopic 3D effect for the viewer. But before we delve deeper into how we could present two viewpoints, let’s take a look at the classic Single Viewpoint case.
In a traditional rendering environment, we render our scene from a single view. We take our objects and transform them into a space suitable for rendering. We do this by applying a series of transformations to our objects where we bring them from a locally defined space into a space that we can draw on our screen.
The classical transformation pipeline starts with objects in their own local/objective space. We then transform the objects with our model or our World Matrix to bring the objects into World Space. The World Space is a common space for the initial, relative placement of objects. Next, we transform our objects from World Space into View Space with our View Matrix. Now our objects are arranged relative to our point of view. Once we have them in the viewport, we can project them with our projection matrix onto our 2D screen and place the objects in the clip space. The perspective division takes place and leads to an NDC space and finally the viewport transformation is applied which leads to a screen space. Once we are in the screen space, we can generate fragments for our render target. For the purposes of our discussion, we will render to a single render target only.
This series of transformations is sometimes referred to as the “Graphic Transformation Pipeline” and is a classic rendering technique.
In addition to the current XR rendering, there were scenarios where we wanted to render simultaneous viewpoints. Maybe we had split screen rendering for local multiplayers. We might have had a separate mini viewport that we would use for an in-game map or a security camera. These alternative views can share scene data, but they often share little else than the final goal of the rendering.
At least, each view often has its own views and projection matrices. To assemble the final render target, we also need to manipulate other properties of the graphics transformation pipeline. In the “early” days when we only had one render target, we could use viewports to dictate partial reactions on the screen into which we could render. As GPUs and their APIs evolved, we were able to render into separate render targets and later manually assemble them.
Enter XRagon.
Modern XR equipment introduced the requirement to control two views to create the stereoscopic 3D effect that adds depth to the equipment carrier. Each view represents one eye. While the two eyes view the same scene from a similar angle, each view has a unique set of viewing and projection matrices.
Before you continue, you should briefly consider the definition of terminology. These are not necessarily standard terms in the industry, as rendering engineers typically use different terms and definitions for different engines and use cases. Treat these terms as a local convenience.
Scene diagram – A scene diagram is a term used to describe a data structure that organizes the information needed to render our scene and consumed by the renderer. The scene diagram can refer either to the scene as a whole or to the visible part we will call the selected scene diagram.
Render Loop/Pipeline – The render loop refers to the logical architecture of how we assemble the rendered frame. This could be a high-level example of a render loop:
Culling > Shadows > Opaque > Transparent > Post Processing > Present
We go through these steps in each frame to create an image that is presented to the display. We also use the term Render Pipeline in Unity because it refers to some upcoming rendering functions that provide how (e.g. Scriptable Render Pipeline). Render pipeline can be confused with other terms, such as the graphics pipeline, which refers to the GPU pipeline to process drawing commands.
With these definitions, we can return to VR rendering.
Multi-camera.
To display the view for each eye, the simplest method is to run the render loop twice. Each eye will configure and run its own iteration of the render loop. At the end we will have two images that we can send to the display device. The underlying implementation uses two Unity cameras, one for each eye, and they go through the process of generating the stereo images. This was the first method of XR support in Unity and is still offered by third-party headset plug-ins.
Although this method certainly works, the multi-camera relies on brute force and is the least efficient in terms of CPU and GPU. The CPU must completely iterate twice through the render loop and the GPU is probably not able to use the caching of objects that have been pulled over its eyes twice.
Multi-pass.
Multi-Pass was Unity’s first attempt to optimize the XR render loop. The core idea was to extract parts of the render loop that were view-independent. This means that any work that does not explicitly rely on XR views does not have to be done per eye.
The most obvious candidate for this optimization is shadow rendering. Shadows are not explicitly dependent on the position of the camera viewer. Unity actually implements shadows in two steps: Create cascaded shadow maps and then assign the shadows to the screen space. For Multi-Pass, we can create a set of cascaded shadow maps and then create two shadow maps for the screen, since the shadow maps for the screen space depend on the viewer’s location. Due to the architecture of our shadow generation, you benefit from the location of the shadow maps of the screen space, since the shadow map generation loop is relatively tightly coupled. This can be compared to the remaining render workload, which requires a full iteration over the render loop before returning to a similar stage (e.g., the eye specific opaque passes are separated by the remaining render loop stages).
The other step that can be divided between the two eyes may not be obvious at first: we can do a single cull between the two eyes. In our first implementation, we used frustum culling to generate two lists of objects, one per eye. However, we could create a uniform culling frustrum that is shared between our two eyes. This means that each eye will do a little more than with a single eye, but we have considered the benefits of a single sorting to outweigh the cost of some additional vertex shaders, clipping, and screening.
Multi-Pass gave us some nice savings over the multi-camera, but there is more to do.
Single pass.
Single pass stereo rendering means that we do a single crossing of the entire render loop instead of certain sections twice.
To do both draws, we need to make sure we have all the constant data and an index bound.
What about the draws themselves? How can we do any draw? In Multi-Pass, each eye has its own render target, but we can’t do that for Single-Pass because the cost of switching render targets for successive draw calls would be prohibitive. A similar option would be to use render target arrays, but we’d have to export the slice index from the geometry shader on most platforms, which can be expensive on the GPU and invasive for existing shaders.
The solution we agreed on was to use a double-wide render target and switch the viewport between draw calls so that each eye could render to half the double-wide render target. Changing viewports is costly, but less invasive than changing render targets and less invasive than using the geometry shader (although Double-Wide poses its own challenges, especially in post processing). There is also the associated option of using viewport arrays, but they have the same problem as rendering target arrays, since the index can only be exported from a geometry shader. There is another technique that uses dynamic clipping, but we won’t explore it here.
Now that we have a solution to start two consecutive draws to see both eyes, we need to configure our supporting infrastructure. In the multi-pass, because it was similar to monoscopic rendering, we could use our existing view and projection matrix infrastructure. All we had to do was replace the view and projection matrix with the current eye matrices. With Single Pass, however, we don’t want to unnecessarily switch between constant buffer bindings. Instead, we tie the view and projection matrices of both eyes together and index them with unity_StereoEyeIndex, which we can invert between trains. This allows our shader infrastructure to choose within the shader pass which set of view and projection matrices they want to render with.
An additional detail: To minimize the Viewport and unity_StereoEyeIndex state changes, we can change our Eye Draw pattern. Instead of drawing to the left, right, left, left, right, right and so on, we can draw the left, right, right, right, right, left, left, etc. instead. cadence. This allows us to halve the number of state updates compared to alternating cadence.
This is not exactly twice as fast as Multi Pass. This is because we’ve already been optimized for culling and shadows, as well as the fact that we’re still sending a draw by eye and switching viewports, which causes some CPU and GPU costs.
Stereo Instancing (Single Pass Instanced).
Previously, we mentioned the possibility of using a render target array. Render Target Arrays are a natural solution for stereo rendering. The eye textures share format and size and qualify them for use in a rendering target array. But using the geometry shader to export the array slice is a big drawback. What we really want is the ability to export the Render Target Array Index from the Vertex shader for easier integration and better performance.
The ability to export the Render Target Array Index from the Vertex Shader actually exists with some GPUs and APIs and is becoming more common. On DX11, this functionality is provided as a feature option VPAndRTArrayIndexFromAnyShaderFeedingRasterizer.
Now that we can determine on which slice of our rendering target array we will render, how can we select the slice? We use the existing Single Pass Double-Wide infrastructure. We can use unity_StereoEyeIndex to fill the SV_RenderTargetArrayIndex semantics in the shader. On the API side, we no longer need to switch the viewport, because the same viewport can be used for both layers of the render target array. And we’ve already configured our matrices so that they can be indexed from the Vertex shader.
Although we could still use the existing technique of outputting two draws and switching the value unity_StereoEyeIndex in the constant buffer before each draw, there is a more efficient technique. We can use GPU instancing to send a single draw call and allow the GPU to multiply our draws through both eyes. We can double the existing instance count of a draw (if there is no instance usage, we simply set the instance count to 2). Then we can decode the instance ID in the vertex shader to determine which eye we are rendering to.
The biggest effect of using this technique is that we literally halve the number of draw calls we generate on the API side, saving part of the CPU time. In addition, the GPU itself is able to process the draws more efficiently, even though the same amount of work is generated, since it does not have to process two single draw calls by not having to change the viewport between the draws as we do with traditional single-pass techniques.
Please note: This is only available for users who are running their desktop VR experience under Windows 10 or HoloLens.
Single Pass Multi View.
Multi-View is an extension for certain OpenGL/OpenGL ES implementations where the driver itself multiplexes individual draw calls across both eyes. Instead of explicitly instantiating the draw call and decoding the instance into an eye index in the shader, the driver is responsible for duplicating the draws and creating the array index (via gl_ViewID) in the shader.
There is an underlying implementation detail that differs from the stereo instance: Instead of the vertex shader, which explicitly selects the render target array layer to be rasterized, the driver itself determines the render target. Gl_ViewID is used to calculate the view-dependent state, but not to select the render target. In use it doesn’t play a big role for the developer, but it is an interesting detail.
Because of the way we use the multi-view extension, we can use the same infrastructure we built for single-pass instancing. Developers are able to use the same framework to support both Single Pass techniques.
High level performance overview.
At Unite Austin 2017, the XR Graphics team presented part of the XR Graphics infrastructure and had a brief discussion about the impact of different stereo rendering modes on performance. Correct performance analysis will be discussed in more detail in a later presentation.
Thank you for visiting.