Simultaneous Localisation and Mapping (SLAM) is an increasingly important topic (especially in the creatio process of 3D configurators) within the Computer Vision Community and is of particular interest in the Augmented und Virtual Reality industry. Since a large number of SLAM systems from science and industry are available, it is worth exploring what SLAM is. This article gives a brief introduction to what SLAM is and what it is used for in computer vision research and development, in particular augmented reality.

What is SLAM?

SLAM is not a particular algorithm or software, but refers to the problem of simultaneous localization, i.e. the position/orientation of a sensor in relation to its environment, while simultaneously mapping the structure of that environment. So when we talk about a “SLAM system” it means “a set of algorithms to solve the problem of simultaneous localization and mapping”.

SLAM is not necessarily just a computer vision problem and does not need to contain visual information. In fact, much of the early research was associated with ground-based robots equipped with laser scanners. For this article, however, I will focus mainly on visual SLAM – where the primary mode of perception is via a camera – as it is of major interest in the context of augmented reality, but many of the topics discussed can be applied more generally.

The requirement to restore both the camera`s position and the map when none is known distinguishes the SLAM problem from other tasks. For example, marker-based tracking is not SLAM because the marker image is known beforhand. Even 3D reconstruction with a fixed camera rig is not SLAM, because while the map is being restored, the positions of the cameras are already known. The challenge in SLAM is to restore both the camera position and the map structure, while initially neither is known.

An important difference between SLAM and other seemingly similar methods of pose and structure restoration is the requirement to work in real time. This is a somewhat shaky concept, but in general it means that the processing of each incoming camera image must be completed by the time the next one arrives, so that the camera pose is immediately available and not as a result of post-processing. This distinguishes SLAM from techniques such as structure and motion, where a set of disordered images is processed offline to restore the 3D structure of an environment in a potentially time-consuming process. This can lead to impressive results, but what matters is that you don`t know where the camera is during shooting.

A short story about SLAM.

Research into the SLAM problem began within the robotics community, usually with wheeled robots traversing a flat base plate. Typically, this was achieved by combining sensor readings with information about the control input and the measured robot condition. This may seem far from tracking a handheld camera moving freely in space, but embodies many of SLAM`s core problems, such as creating a consistent and accurate map and making optimal use of multiple unreliable information sources.

Recently, the use of visual sensors has become an important aspect of SLAM research, also because an image is a rich source of information about the structure of the environment. Much of the research on visual SLAM uses stereo cameras or cameras alongside other sensors, but since about 2001 a number of studies have shown how SLAM successfully works with a single camera (known as monocular visual SLAM). One example is the groundbreaking work of Andrew Davison at Oxford University.

This was crucial in making SLAM a much more usefl technology, as devices equipped with a single camera – such as webcams and mobile phones – are much more common and accessible than special measuring devices. Recent work has shown how monocular visual SLAM can be used to create large-format maps, how maps can be automatically extended with meaningful 3D structures, and how extremely detailed shapes can be restored in real time. SLAM is an active field of computer vision research and new and improved techniques are constantly emerging.

For privacy reasons YouTube needs your permission to be loaded. For more details, please see our Datenschutzerklärung.
I Accept

At its core, SLAM is an optimization problem that involves calculating the best configuration of camera positions and point positions to minimize the reproduction error (the difference between the tracked location of a point and the location where it should maintain the camera position across all points). The method of choice for solving this problem is called bundle adjustment, a nonlinear least square algorithm that iteratively approaches the minimum error for the entire system when properly configured.

The problem with bundle adjustment is that it can be very time-consuming to find the best solution. In addition, the time required increases rapidly with the size of the map. With the advent of multicore machines, this is solved by separating localization from mapping. This means that localization (tracking points to estimate the current camera position) can be done in real time on a thread, while the mapping thread can perform bundle adjustment on the map in the background. After completion, the mapping thread updates the map used for tracking and the tracker adds new observations to extend the map. This can be seen in the video above, where the tracking is continuous, but the map is only periodically optimized.

Apart from details such as the method of point tracking, the initialization of the map, the robustness against faulty matches and different strategies for more efficient optimization of the map, the basic functionality of a SLAM system is described here. In addition, additional functions are often integrated to make SLAM more suitable in practice. A critical feature for a larger mapping is, for example, closing stripes, where the gradual accumulation of errors over time can be improved by linking the current position to an earlier part of the map, forcing additional constraints on the optimization problem to obtain a more accurate map overall.

Another key technique is relocalization, which is able to temporarily handle poor tracking performance that could otherwise lead to complete system failure. This allows the tracker to be restarted by determining which part of the previously visited map comes closest to the current camera image. This is demonstrated below using our SLAM system, where tracking fails and is restored as soon as the map is visible again. This also solves the “kidnapped camera” problem where the camera is blocked when moving to another location: Tracking continues successfully even after changing the camera’s viewing angle.

For privacy reasons YouTube needs your permission to be loaded. For more details, please see our Datenschutzerklärung.
I Accept


The above gives a quicker overview of what the SLAM problem involves and how it would be solved. But why is this important and when does a SLAM system make sense? As with the original robotics research, locating a camera in an unknown environment is useful when it comes to exploring and navigating terrain for which no previous map is available – it is even used to explore the surface of Mars.

This ability to accurately locate a camera without prior reference is also critical for its use in augmented reality. In this way, virtual content can be fixed in relation to the real world: This is because the real world influences the map and the camera that takes the images to be expanded. Therefore a fast and accurate localization is crucial (otherwise a delay or drift would be visible in the rendered graphics). The difference between SLAM and something like our Marker Tracker is that it is not necessary to have a pre-defined image target to create argumentation; it is still important that certain objects remain in view.

More sophisticated augmented reality applications are possible because a map is created as a necessary step in localization. This enables applications that go beyond simply adding virtual content to the world coordinate frame, where virtual content can respond to real objects in the scene.

Instead of creating a map of an environment, a SLAM system can also be used to create a 3D reconstruction of an object by simply moving a camera around it. For example, the upcoming Kudan Slam system will be used to show the process of creating a point-based 3D model of a complicated object and then track it while deliberately ignoring the background.

For privacy reasons YouTube needs your permission to be loaded. For more details, please see our Datenschutzerklärung.
I Accept

It is important to note that while SLAM allows a wide variety of applications, there are some things it is not the right tool for. SLAM systems assume that the camera moves through an immutable scene, making it unsuitable for tasks such as tracking people and gesture recognition: both are not rigidly deformed objects and a non-static map, although these tasks can actually be handled by computer vision, they are not “mapping” problems. Similarly, vision tasks such as face recognition, image understanding, and classification are not associated with SLAM.

However, as shown above, a number of important tasks such as tracking moving cameras, augmented reality, map reconstruction, interactions between virtual and real objects, object tracking, and 3D modeling can be performed with a SLAM system. Moreover, the availability of such a technology will lead to further developments and increased complexity in augmented reality applications.