Rendering 3D world images in a game inevitably wastes resources spent on processing elements that are invisible to the player. These resources could be better used to increase the visual complexity of the visible elements (for example 3D configurators) or reduce the time required to create a frame. To do this, we need to identify the objects that are not visible to the player. Determining the set of elements that are not visible from a certain point of view because they are hidden by elements in front of them is called occlusion culling.

ambient occlusion demo e

In interactive real-time applications, occlusion culling is traditionally used as a rendering optimization technique. It enables the production of images at a speed that allows the perception of continuous motion. However, there are a variety of uses for visibility information outside of pure rendering. Knowing whether an object is currently visible or not can be of great use.

  • Influence on AI behavior.
  • Simplification or deactivation of physical simulation and animation.
  • Reduce the amount of network traffic required to replicate player positions across the network.

When assessing the value of an occlusion culling system, there are some desirable properties to mention:

It works automatically and universally.

Ideally, an Occlusion Culling System will automatically work with any type of 3D content, from simple object viewers to huge virtual worlds, and does not require manual work by the artists who build and model the game world. In addition, the system should not restrict the artist’s creativity. After all, the system should not depend on specific hardware features, rendering conventions, authoring methods, or tools.

It works correctly.

A system that sometimes determines fully or partially visible objects to be fully enclosed must generate rendering artifacts, while a system that sometimes reports fully enclosed objects as visible can usually generate the correct visual output.

It creates added value.

For rendering purposes, occlusion culling must be judged by the reference solution, which simply displays everything in “view frustration”. For example, a sports game playing in an arena where only a small part of the total content at a given time is included is not a good candidate for an occlusion system. The effort to determine the occlusion is wasted because no benefit can be obtained. However, when visual complexity and much detail are required in complex 3D worlds, the benefits of an occlusion system begin to increase significantly. In this article we first introduce the problem domain and take a look at the common methods currently used for occlusion culling and the challenges they present in game development. We then describe a novel approach to occlusion culling designed to meet the needs of our partners and customers developing the next generation of game engines.

This approach is called the Umbra 3 Occlusion Culling System.

Background.

The roots of occlusion culling in 3D graphics lie in algorithms for determining hidden lines and hidden surfaces. These algorithms are necessary to produce visually correct images of 2D projections in a 3D space.

The problem is easy to grasp – from the rays of light that travel from surfaces in the world to their eyes, only those that do not encounter obstacles along the way contribute to the final image. This is an example of the visibility problem and the formulation readily suggests a possible solution for 3D rendering: We could simply retrace the light rays from the eye and find the first surface with which each beam intersects.

All modern polygon raster renderers, both soft and hardware, track the smallest distance value per sample and only update the sample if a distance value is less than the current minimum. This solution is called Z buffering or depth buffering for the additional buffer of the maintained Z values. Given the work already done on 2D projection and raster sampling, calculating the Z component is relatively inexpensive and guarantees the right visual experience. The calculation can be reduced by introducing primitives in a front-to-back order: rendering the nearest primitive for a given sample means that the contribution of all other primitives along the same beam can be rejected by the depth test, skipping any calculation to determine the final output of a hidden sample, such as interpolating node attributes or performing texture lookups.

Z-buffering with a primitive front-to-back rendering order comes pretty close to the ideal of calculating only a single value for each output pixel. Unfortunately, the culling takes place very late in the rendering pipeline of a 3D game application. At the point where the renderer rejects a covert sample, the sample has gone through several stages of the graphics pipeline, from feeding the input geometry and dependent resources to the GPU for per-vertex processing, triangulation, and rasterization. There are methods for early Z-culling in the rendering pipeline, but ultimately the greatest savings in calculation are achieved by selecting the largest renderable units of the engines before feeding them into the rendering subsystem.

Occluder Grid.

Most strategies for removing occlusion culling at runtime tell whether an object is visible by performing the equivalent of a sample-based depth test for the transformed boundaries of potentially occluded single objects or object hierarchies. The challenge is to create a depth buffer estimate of the view before the actual rendering takes place.

A common solution is to use a separate depth rendering pass or the depth results of a previous frame on the GPU and use occlusion queries. An Occlusion query returns the number of samples that may pass the Depth Test without actually processing or writing pixel values. The challenge with this approach is the synchronization between CPU and GPU required to transfer the test data and obtain occlusion information. In practice, it is virtually impossible to use this approach for anything other than rendering optimization.

The advantage is that there is no additional work involved in generating the depth buffer and it represents the exact final image for any type of occluder geometry. To enable asynchronous operation of the CPU and GPU and minimize data traffic between them, these systems typically use the previous results of the frame depth buffer and therefore cannot guarantee conservative sorting. To obtain conservative culling, the geometry of the grid coverage must not exceed the actual result. Normally, content artists create these occlusion models manually, side by side with the real models.

Potentially visible quantities (PVS).

If the runtime costs for early depth buffer generation and occlusion tests are not feasible and the occluder geometry is mostly static, it is a reasonable alternative to determine and store the visibility relationships between view cells and renderable objects in a pre-processing phase. The set of entities that are visibly determined from a view cell is called a potentially visible set. The run operation simply finds the view cell of the current camera location and searches for the set from memory. In simple cases, visibility relationships can be created manually by the level designer, but the usual method is to capture visibility from a visual cell either by beam delivery or rasterization in all directions. It is difficult to ensure conservative culling in both cases. By increasing the number of samples per view line, the error rate can be controlled at the expense of the time required for the calculation. In addition to the static target objects, volumetric visibility information can also be stored in the set in the form of target cells to sort out non-static objects.

The generation of potentially visible sets can be automated, but a large amount of data must be generated to achieve reasonable selection results. The sampling time in the pre-process phase slows down the iteration cycle of the content and the sheer amount of data to display the visibility relationships can quickly become manageable. This is all the more true as visibility relationships are of a global nature. A small change in Occluder geometry can lead to changes in visibility relationships far away and on all sides of the original change and therefore requires a recalculation of the potentially visible set of a large area.

Portals and Cells.

A third category of occlusion culling systems is based on dividing the static game world into cells and capturing visibility between adjacent cells with 2D portals. The runtime operation is to find the cell in which the camera is located and traverse the formed cell diagram, limiting visibility along the way by clipping the view frustrum onto the portals it traverses. Objects are assigned to cells in a pre-processing phase and their visibility is determined by visiting the cell they are in. This approach works best when there are obvious hotspots around the world where the level designer can place portals, such as doors or windows that connect indoor spaces.

The portal and cell data is a simplified occlusion model of the world that stores non-occluders and their connectivity instead of occluders. While accurate and conservative occlusion results can be achieved at relatively low runtime costs, the work of manually placing cells and portals is very labor-intensive, error-prone and dramatically increases the cost of content modification.

Umbra 3: the next generation of occlusion culling.

The Umbra 3 Occlusion Culling System is designed to meet the challenges described in the previous chapters.

Umbra 3 is based on the idea of transforming a 3D world from a detailed polygon representation created by the artist into a simplified one. The transformation focuses on capturing the essential occlusion properties while ignoring information about details that have little or no relevance in the occlusion culling process. This approach is similar to the manually created portal and cell diagram. However, the occlusion model used by Umbra 3 is an automatically generated portal and cell diagram.

Traditionally, the game worlds in which portal culling is successfully used have been designed so that a small number of portals can capture rough occlusion information. Empty spaces connected by narrow passages, with most connections taking place on or near the base plate. A general solution cannot make such assumptions about the world and must work with arbitrarily complicated topologies in all three dimensions. A much larger number of cells and portals is required to capture the closure of large external surfaces with individual closure characteristics.

Umbra 3 consists of two components:

The Optimizer.

The Optimizer is integrated into the game editor during the pre-processing phase. It takes as input any kind of polygon soup and discretely discriminates the entire scene in voxel to generate view cells. The view cells are then connected through portals. The diagram data of the portals and cells are stored in an optimized data structure that is accessed at runtime.

The runtime component.

The runtime component rasters a software depth buffer that is used to test visibility. It enables a large number of dynamic object tests per frame at very low computing costs. The accuracy of the algorithm is close to that of hardware occlusion queries, but as a pure software implementation it is free of all synchronization and portability problems typical of hardware-based occlusion culling.

The technical challenges for the automatically generated portal and cell diagram can be summarized as follows:

  • A random polygon soup must be transformed into a representative portal and cell diagram so that the size of the diagram and the granularity of the occlusion can be controlled.
  • An effective runtime culling method is required. It must be able to use the portal and cell chart and handle a relatively large number of portals. In addition, it must deliver conservative culling results in fixed or limited memory and do so quickly enough not to become a bottleneck in a minimal occlusion scenario.

The following two chapters describe solutions to these two challenges as implemented in the Umbra 3 Occlusion Culling System.

Generation of occlusion data.

At a high level, the diagram for a 3D scene is generated by creating a raw, evenly distributed source portal diagram. This diagram is simplified gradually by removing data that contributes little to the overall occlusion. The entered occlusion meshes are used only in the first phase of cell generation to determine whether a single voxel or an axis-aligned box should be considered fixed or empty.

Voxification allows the algorithm to be independent of geometric complexity. This is made possible by the fact that the result only has to be conservatively compensated for the real geometry.

Initial cell generation.

The initial diagram is defined by first dividing the scene geometry into an axis-oriented mesh. The size of the initial raster node is chosen according to the granularity at which we want to examine the scene shutter functions. Next, we further subdivide the geometry into voxels that are either fixed or empty by intersecting the voxel with the geometry.

The voxels are used to find unconnected areas within the cube by filling empty voxels with floods, i.e. recursively grouping adjacent empty voxels. The empty voxel sets are the first candidate cells for our graph. For each of these empty voxel groups we find the intersection point with the cube surfaces and save the axially parallel boundaries of the intersection point as a portal rectangle. In order to connect the initial cells to a diagram, the portal is adapted to a neighboring portal, which is generated on the other side of the cube surface.

The advantage of using a discrete volumetric representation of the geometry should be obvious. The intersection of any polygon data with axis-aligned boxes can be done numerically robustly and after the formation of the voxels the algorithm is independent of the complexity of the input data and only has to deal with a simple integer mesh. Since we are after a conservative occlusion representation, we do not have to follow the geometry exactly as long as we are on the conservative side.

The only exception to this method is the connection determination within the grid node. A hole in the mesh data that is smaller than the selected voxel size is not captured by a path of empty voxels and does not contribute to connectivity. Fortunately, small, isolated, transparent holes are relatively rare and often unintentional in games.

If we have information about static targets at this point, we can continue to voxel their geometry similarly and assign objects to our source cells by finding the cell to which the neighboring empty voxels belong. If the same geometry is to be used both as an occluder and as a static target object, we can of course save half the work.

Graphimplification.

If we find the portal with the smallest occlusion value in our portal and cell diagram, we can remove the portal by combining the cells that connect the portal to each other. This method forms a simplified diagram that minimizes the loss of all occlusion information. As we iteratively advance to a portal lock curve threshold or a fixed storage budget, we can control the cost-benefit ratio of the final graph.

Optimal simplification of the graph by grouping the two cells connected by the portal that have the worst global occlusion value would require unaffordable global graph analysis because visibility is essentially a global attribute. However, local heuristics can be used to quickly remove the most useless portals from the graph. Simplification can be done hierarchically by lowering the threshold for portal locks as we look up in the hierarchy to larger areas. At each new hierarchy level, we connect the external portals of neighboring volumes and perform the simplification for the newly formed combined graph.

The hierarchical simplification also offers the possibility to output detailed versions of occlusion data. The runtime algorithm can use these outputs to select the occlusion level used based on distance from point of view.

View Tree.

The graphs of portals and cells can become very complex. The last game cells, formed by dividing them into contiguous voxel groups and then grouping them, are no longer easily captured by simple geometric shapes, and saving the complete voxel structure of even a small scene is impractical. To use the portal and cell diagram for runtime culling, a different data structure is required to accurately locate the cell corresponding to the camera position on the diagram. This data structure is called the View Tree.

In summary, the View Tree is structured as follows:

  • Two adjacent empty voxels belonging to the same cell can be merged as their separate storage does not add any information. Large empty areas can be collapsed into axially parallel elements in the View Tree.
  • In gaming applications, the camera must not be placed arbitrarily close to the geometry. This ensures that the near clipping layer cannot overlap with the geometry, as this would lead to unwanted render results. Therefore, we can collapse adjacent voxels and even empty voxels belonging to separate cells as long as the size of the resulting element does not exceed the collision radius of the camera.
  • Finally, we can identify and simplify voxel groups that are located in volumes where the camera must not be present. This also includes cells that are formed inside walls and below the terrain.

Tiles.

Apart from adjusting the parameters for voxelisation, the portal generation process is automatic. However, it may take some time to finish. This is unpleasant from the content iteration point of view and wastes valuable artist and designer time. Memory consumption is also a problem. Uncomplicated, high-resolution voxelization can consume gigabytes of memory.

Umbra 3 avoids this problem by dividing the world into spatial arithmetic units, tiles, which can be calculated independently. The world can then be calculated tile by tile in parallel or distributed. Only the currently processed tile set needs memory for the high-resolution voxel representation.

Since the process of portal generation is deterministic, calculation results for a tile can also be saved by hash input. This approach supports incremental changes where only the updated part of a scene has to be recalculated. The results can also be exchanged over the network. At runtime, the visibility data can be streamed tile by tile.

Run-Time Portal Culling.

The other component of the Umbra 3 solution is a run-time portal culling algorithm. This algorithm uses the simplified occlusion model to determine visibility from the camera view.

The runtime culling algorithm must be fast – the entire process must be completed in the order of a few milliseconds – to meet the value creation requirement in a 60 frames per second game.

Since the amount of portal graphics data processed can be significant, a key consideration in algorithm design is data access and the caching pattern. This is especially true for hardware architectures that are optimized for streaming parallel data, such as the Cell processor in PlayStation 3.

Graph Run.

The automatically generated portal and cell diagram is not fundamentally different from a traditional hand-crafted diagram. Therefore, we could easily use the traditional algorithm of portal culling, where the diagram is recursively traversed while we tailor the view frustration to the entered portals. However, this type of traversal must visit each cell once for each visible path. In the worst case, this can lead to an exponential algorithm runtime and catastrophic poor performance.

To limit the re-entry of cells, the Umbra 3 Runtime Culling algorithm usually runs through the diagram in width first. Instead of a geometric portal solver, the software raster is used to collect the visibility information in a screen space lock buffer. This method almost completely eliminates excessive recursions. However, since the cells are built from the initial grid cube, a true processing order of the first cell may not even exist: There may be cycles between cells in the graph that leave no choice but to process a cell more than once.

To remedy this, the graph is saved as box-shaped tiles, each containing dozens to hundreds of cells. The tiles are created so that a simple front-to-background order can be defined between them. The Per-Tile traverse also makes access to the diagram data spatially coherent and cache-friendly.

The advantage of tiling is that we know that once all cells in a tile have been processed, we have collected the final visibility information for all and can therefore securely share any resources held to capture visibility per cell.

Portal grid.

Hierarchical tile diagram traversal helps to use the number of cells that are active at a given time. Another important memory optimization is the data layout of the occlusion buffer. Here, too, the fact that only conservatively correct results are required provides a cost-effective solution. The portal rectangles are rasterized into a low resolution 1 bit per sample coverage buffer. Each bit set in the buffer represents an unlocked path to the cell.

The concept of screening the empty space (the portals) instead of a representation of the occluding geometry may seem unorthodox at first. The rasterization of transformed Quads allows a simple and conservative handling of geometric edges. The Rasterizer performs half-space tests against the portal edge functions and saves computing resources in large indoor or outdoor areas by hierarchically subdividing the screen rectangle and testing against rectangular corner points. Conservative rasterization also allows the resolution to be scaled based on the performance of the runtime platform.

The total amount of memory required by the portal culling algorithm, including the traverse data structures of the output depth buffer and the coverage buffer for the active cell set, is fixed at 85 kilobytes. This makes it small enough to fit into the local memory of modern flow-oriented processing units.

Occlusion Testing.

Occlusion testing of static objects located in the cells can be performed during diagram traversal by determining whether there are visible pixels in the cover buffer that overlap with the objects’ screen space boundaries.

Assigning static objects to cells is easy in the pre-processing phase, where we can afford to work in voxel representation. However, all dynamic entities that we want to test require a different solution. For this purpose, the extraction algorithm builds a global depth buffer during the process. The depth buffer of each cell traversed is updated based on the current cover buffer, using the cell boundary frame spacing as the depth write value. In this way, only a single 16-bit 128×128 depth buffer needs to be held in memory for the duration of the movement.

The ability to output a single depth buffer from the traverse allows the dynamic object lock tests to be performed at any time after diagram execution. This can be particularly important if the exact position and circumference of the animating objects are to be determined parallel to the execution of the portal readout traverse.

Conclusions.

If you exceed the limit of total complexity that can be represented in a game, accurate visibility information is an invaluable advantage. Determining the elements that are not visible from a certain angle is called occlusion culling. Many approaches have been developed and used to solve the problem of occlusion culling. However, every traditional approach involves a number of compromises and therefore culling remains an active area of research.

The Umbra 3 Occlusion Culling System was developed to overcome the compromises of traditional occlusion culling solutions. Umbra 3 automatically processes any type of polygonal input scene into a compact runtime representation. This representation can be used for efficient and conservatively correct occlusion culling. The most important methods are:

  • The discretization of the input geometry during the voxelization process.
  • The runtime software rasterization, which quickly generates the list of visible objects together with a conservative depth buffer in the desired resolution.

The portal graphics and runtime resolution parameters allow the system to seamlessly scale from low to high-end devices. Therefore Umbra 3 is also a practicable solution for smartphone games.

In summary, Umbra 3 enables the creation of massive 3D game worlds from random polygon soups without manual marking for visibility and at the same time guarantees conservatively correct culling results. Umbra 3 also makes visibility and spatial argumentation information accessible to a broader category of game machine subsystems. For this reason, game developers around the world, such as Bungie and many others, use Umbra 3 in various productions on a variety of runtime platforms.