With version 2017.2, Unity has released support for stereo instancing for XR devices on DX11, giving developers access to even more performance enhancements for HTC Vive, Oculus Rift and the new Windows Mixed Reality immersive headsets. Stereo instancing is an important aspect in the creating process of a 3D configurator. In the following article, we’ll take this opportunity to take a closer look at this exciting evolution of rendering and show you how to take advantage of it.

stereo rendering e

One of the unique and most obvious aspects of XR rendering is the need to generate two views, one per eye. We need these two views to create the stereoscopic 3D effect for the viewer. But before we dive into how we can present two viewpoints, let’s take a look at the classic Single Viewport case.

In a traditional rendering environment, we render our scene from a single view. We take our objects and turn them into a space suitable for rendering. We do this by applying a series of transformations to our objects where we take them from a locally defined space to a space that we can draw on our screen.

The classic transformation pipeline starts with objects in their own local/objective space. We then transform the objects with our model or our World Matrix to bring the objects into World Space. The World Space is a common space for the initial, relative placement of objects. Next, we transform our objects from the World into the View Space, with our View Matrix. Now our objects are arranged relative to our point of view. Once we have them in the viewing area, we can project them with our projection matrix onto our 2D canvas and place the objects in the clip space. The perspective division follows, leads to an NDC space, and finally the viewport transformation is applied, leading to a screen space. Once we are in the screen space, we can generate fragments for our render target. For the purposes of our discussion, we will render only to a single render target.

This series of transformations is sometimes called a “Graphic Transformation Pipeline” and is a classic rendering technique.

In addition to the current XR rendering, there were scenarios where we wanted to render simultaneous viewports. Maybe we had split screen rendering for local multiplayers. We might have had a separate mini viewport that we would use for an in-game map or a security camera. These alternative views can share scene data with each other, but they often share little else than the final goal of the rendering.

At least, each view often has its own views and projection matrices. To assemble the final render target, we also need to manipulate other properties of the graphics transformation pipeline. In the “early” days when we only had one render target, we could use viewports to dictate partial reactions on the screen into which we could render. As GPUs and their APIs evolved, we were able to render into separate render targets and later manually assemble them.

Type XRagon.

Modern XR devices introduced the requirement to control two views to create the stereoscopic 3D effect that adds depth to the device carrier. Each view represents one eye. While the two eyes view the same scene from a similar angle, each view has a unique set of viewing and projection matrices.

Before you continue, you should briefly review the definition of terminology. These are not necessarily standard terms in the industry, as rendering engineers typically use different terms and definitions for different engines and use cases. Treat these terms as a local convenience.

Scene Diagram – A scene diagram is a term used to describe a data structure that organizes the information needed for rendering our scene and consumed by the renderer. The scene diagram can refer either to the scene as a whole or to the part visible for viewing, which we will call the “selected scene diagram”.

Render Loop/Pipeline – The render loop refers to the logical architecture of how we assemble the rendered frame. This could be a high-level example of a render loop:

Culling > Shadows > Opaque > Transparent > Post Processing > Present

We go through these steps in each frame to create an image that is presented to the ad. We also use the term Render Pipeline in Unity because it refers to some of the upcoming rendering functions we provide (e.g. Scriptable Render Pipeline). Render pipeline can be confused with other terms, such as the graphics pipeline, which refers to the GPU pipeline to process drawing commands.

Ok, with these definitions we can return to VR rendering.


To display the view for each eye, the simplest method is to run the render loop twice. Each eye will configure and go through its own iteration of the render loop. At the end we will have two images that we can send to the display device. The underlying implementation uses two Unity cameras, one for each eye and they go through the process of generating stereo images. This was the first method of XR support in Unity and is still offered by third-party headset plug-ins.

Although this method certainly works, the multi-camera relies on brute force and is the least efficient in terms of CPU and GPU. The CPU must completely iterate through the render loop twice and the GPU is probably not able to use the caching of objects that have been pulled over its eyes twice.


Multi-Pass was Unity’s first attempt to optimize the XR render loop. The core idea was to extract parts of the render loop that were view-independent. This means that any work that does not explicitly rely on XR views does not have to be done per eye.

The most obvious candidate for this optimization is shadow rendering. Shadows are not explicitly independent of the position of the camera viewer. Unity actually implements shadows in two steps: Create cascaded shadow maps and then assign the shadows to the screen space. For Multi-Pass, we can create a set of cascaded shadow maps and then create two shadow maps for the screen, since the shadow maps for the screen space depend on the viewer’s location. Due to the architecture of our shadow generation, the shadow maps of the screen space benefit from the locality because the shadow map generation loop is relatively tightly coupled. This can be compared to the remaining render workflow, which requires complete iteration over the render loop before returning to a similar stage (e.g., the eye specific opaque passes are separated by the remaining render loop stages).

The other step that can be split between the two eyes may not be obvious at first: we can do a single culling between the two eyes. In our first implementation we used frustum culling to generate two lists of objects, one per eye. However, we could create a uniform culling face that is divided between our two eyes. This means that each eye will do a little more than with a single eye culling frustrum, but we’ve considered the benefits of a single sorting to outweigh the cost of some additional vertex shaders, clipping, and screening.

Multi-Pass gave us some nice savings over the multi-camera, but there’s more to do.

Single pass.

Single-pass stereo rendering means we do a single crossing of the entire render loop instead of going through two or more sections.

To do both draws, we need to make sure we have all the constant data and an index bound.

What about the draws themselves? How can we do any draw? In Multi-Pass, each eye has its own render target, but we can’t do that for Single-Pass because the cost of switching render targets for successive draw calls would be prohibitive. A similar option would be to use render target arrays, but we would have to export the slice index from the geometry shader on most platforms, which can be expensive on the GPU and invasive for existing shaders.

The solution we agreed on was to use a double-wide render target and switch the viewport between draw calls so that each eye could render to half the double-wide render target. Changing viewports is costly, but less complicated than changing rendering targets and less invasive than using the geometry shader. There is also the associated option of using viewport arrays, but they have the same problem as render target arrays because the index can only be exported from a geometry shader. There is another technique that uses dynamic clipping that we won’t explore here.

Now that we have a solution to start two consecutive draws to see both eyes, we need to configure our supporting infrastructure. In the multi-pass, because it was similar to monoscopic rendering, we could use our existing view and projection matrix infrastructure. All we had to do was replace the view and projection matrix with the current eye matrices. With Single-Pass, however, we don’t want to switch unnecessarily between constant buffer bindings. Instead, we tie the visual and projection matrices of both eyes together and index them with unity_StereoEyeIndex, which we can mirror between the draws. This allows our shader infrastructure to choose within the shader pass which set of view and projection matrices to render them with.

An additional detail: To minimize our Viewport and unity_StereoEyeIndex state changes, we can change our eye drawing pattern. Instead of drawing to the left, right, left, right, etc., we can draw the left, right, right, left, left, etc. instead. Use rhythm. This allows us to halve the number of state updates compared to the alternating cadence.

This is not exactly twice as fast as Multi-Pass. This is because we’ve already been optimized for culling and shadows, as well as the fact that we’re still sending a draw by eye and switching viewports, which causes some CPU and GPU costs.

Stereo Instancing (Single-Pass Instanced).

Previously, we mentioned the possibility of using a render target array. Render target arrays are a natural solution for stereo rendering. The eye textures share format and size and qualify them for use in a render target array. But using the geometry shader to export the array slice is a big drawback. What we really want is the ability to export the render target array index from the Vertex shader for easier integration and better performance.

The ability to export the render target array index from the vertex shader actually exists with some GPUs and APIs and is becoming more common. On DX11, this functionality is provided as a feature option VPAndRTArrayIndexFromAnyShaderFeedingRasterizer.

Now that we can determine on which slice of our render target array we will render, how can we select the slice? We use the existing Single-Pass Double-Wide infrastructure. We can use unity_StereoEyeIndex to fill the SV_RenderTargetArrayIndex semantics in the shader. On the API side, we no longer need to switch the viewport, since the same viewport can be used for both slices of the render target array. And we’ve already configured our matrices to be indexable from the Vertex shader.

Although we could still use the existing technique of outputting two draws and switching the unity_StereoEyeIndex value in the constant buffer before each draw, there is an efficient technique. We can use GPU instancing to send a single draw call and allow the GPU to multiply our draws through both eyes. We can double the existing number of instances of a draw. We can then decode the instance ID in the vertex shader to determine which eye we are rendering to.

The biggest impact of using this technique is that we literally halve the number of draw calls we generate on the API side, saving part of the CPU time. In addition, the GPU itself is able to process the draws more efficiently even though the same amount of work is generated because it does not have to process two single drawa calls. We also minimize state updates by not having to change the viewport between draws, as we do with traditional single-pass techniques.

Single-Pass Multi-View.

Multi-View is an extension for certain OpenGL/OpenGL ES implementations where the driver itself multiplexes individual draw calls across both eyes. Instead of explicitly instantiating the draw call and decoding the instance into an eye index in the shader, the driver is responsible for duplicating the draws and creating the array index (via gl_ViewID) in the shader.

There is an underlying implementation detail that differs from the stereo instance: Instead of the vertex shader, which explicitly selects the render target array layer to be rasterized, the driver itself determines the render target. gl_ViewID is used to calculate the view-dependent state, not to select the render target. In use it doesn’t play a big role for the developer, but it is an interesting detail.

Because of the way we use the Multi-View extension, we can use the same infrastructure we built for single-pass instancing. Developers are able to use the same framework to support both single-pass techniques.

Performance at a high level.

At Unite Austin 2017, the XR Graphics team presented part of the XR Graphics infrastructure and had a brief discussion about the impact of different stereo rendering modes on performance. We could write our own blog post about the whole performance analysis, but we will work through this quite briefly here.

  • Single-Pass and Single-Pass-Instancing represent a significant CPU advantage over Multi-Pass. This is because most of the CPU overhead is already saved by switching to single-pass.
  • Single-pass instancing reduces the number of draw calls, but these costs are quite low compared to the processing of the scene graph.
  • Considering that most modern graphics drivers are multi-threaded, the output of draw calls can be done quite quickly on the dispatching CPU thread.
    We hope we were able to give them a brief overview of the subject. If you have any further questions, please feel free to contact our experts in our forum.

Thank you very much for your visit.