SLAM from an Algorithmic Point of View (Part 2)
As we discussed in Part 1, SLAM is the process of concurrently estimating a robotic vehicle's position in an uncharted area while gradually creating a map of it. SLAM algorithms can be classified into three primary techniques used. They are filter-based SLAM, graph-based SLAM, and deep learning-based SLAM.
Filter-based SLAM treats SLAM as a state estimation problem. Here, a probabilistic filter, such as the Extended Kalman Filter (EKF) or the Unscented Kalman Filter (UKF), is commonly used to recursively estimate the state of the robot and update the map based on sensor measurements. The filter predicts the next state of the robot based on its motion model and then corrects this prediction using sensor measurements.
In Graph-based SLAM, in contrast to filter-based SLAM, the problem is approached as a graph optimization issue. Here, the SLAM problem is formulated as a graph, where nodes represent robot poses or landmarks in the environment, and edges represent measurements or constraints between them. The goal of graph-based SLAM is to optimize the poses of the robot and the positions of the landmarks such that the measured constraints (e.g., distances between landmarks, relative poses between robot poses) are satisfied as accurately as possible.
Deep learning-based SLAM methods leverage neural networks to directly learn representations of the environment from sensor data without relying on handcrafted features or models. These methods can learn complex mappings between sensor measurements and the robot's pose or map, enabling end-to-end SLAM solutions.
The core functionalities of SLAM, mapping, and localization are intricately linked, with the robot continuously updating its map based on sensor data and adjusting its position estimate accordingly. Being a modular tool, SLAM and its concepts allow replacement and changes in the pipeline. Thus, very often, several algorithms are developed and used in tandem, making generalizing and explaining SLAM as a single algorithm cumbersome. So, the best way to understand SLAM is to focus on one particular implementation of SLAM. That being said, let’s discuss filter-based visual SLAM (vSLAM) in detail.
Visual SLAM
As the name implies, vSLAM uses a visual sensor, a camera, as its primary sensor. Additionally, it may have encoders, an inertial measurement unit (IMU), and other sensors as well. A generic block diagram of the implementation can be seen in Figure 1.
Figure 1: Simplified generic block diagram of feature-based SLAM process. (Reproduced from kudan.io)
Camera Measurement
The camera captures images of the robot's surroundings, including features like landmarks, edges, and textures. The captured images, however, need to be distortion corrected as most camera lenses introduce some level of distortion. Cameras used could be stereo, monocular, or RGB-D cameras with Time of Flight (ToF) depth sensors. The advantage of stereo and RGB-D cameras is that depth information is easily available. But monocular cameras struggle from scale ambiguity. That is, monocular SLAM cannot identify the length of translational movement (scale factor) only from feature correspondences. However, there are ways to mitigate this, which is out of the scope of this article.
Feature Extraction
After capturing the images using a camera sensor, we need to uniquely identify the frame by finding out the features of a particular frame for future reference. Feature in this context is the collection of pixels that are unique and that can be consistently identified. Or we can say that they are distinctive points in an image, which is invariant to rotation, scale, and distortion, thus making them easy to re-identify even after image manipulations. Considering that we are using a stereo camera as our primary sensor, we should be able to see overlapping features between the stereo images captured by the camera. These identical features can then be used to estimate the distance from the sensor. However, before this, we need to identify common features on the stereo image pair, as mentioned before. This is done by feature detectors and matchers. Some of the common examples of feature detectors are Scale-invariant feature transform (SIFT), Oriented FAST and rotated BRIEF(ORB), and Good Features to Track (GFTT). Figure 2 represents the features identified using some of the popular feature detectors. Once the features are identified, they are given a description using the same feature detectors. This process helps in the easy reidentification of these features in the future.
Figure 2: Key points detected on the XRP robot image using a) GFTT b) SIFT.(Source SparkFun Electronics)
After finding out the key points, we establish correspondence between these points by matching them. Some of the feature-matching algorithms that can be used are Brute Force Matcher or Fast Library for Approximate Nearest Neighbors (FLANN). Figure 3 shows a visual representation of the matching algorithm in action. Lines are shown connecting the matches, and since we used mirror images, ideally, we should only get horizontal (parallel) lines, provided that the system is perfect. But, unfortunately, the feature-matching algorithms are hardly perfect and thus result in false matches, some of them are indicated by slanted lines. This is why we need outlier rejection tools like Random Sample Consensus (RANSAC).
Figure 3: Lines indicating the feature matches on two inverted images using FLANN. (Source SparkFun electronics)
RANSAC
Algorithms like RANSAC are used to filter out these incorrect matches, ensuring that only inliers (correct matches) are used for further processing. RANSAC functions by building a model from a random subset of the provided data. That is, we consider some random points as inliers and try to match all the remaining points based on these selected few points. Then, we assess how well the model matches the entire dataset. This process is repeated until a model that accurately describes the data with minimal error, as determined by a cost function, is found.
Feature and Data Association
In this step, we take the detected features and their estimated location in space to create a map of these features. As the process progresses over the subsequent frames, the system associates new features with known elements of the map and discards uncertain features.
As the camera's motion is tracked over subsequent frames, predictions can be made based on the known features and how they are expected to change with motion. However, the computational resources and time constraints, especially in real-time applications, impose limitations on SLAM. As the system collects more measurements of features and updates the location/pose over time, it becomes essential to constrain and optimize the representation of the environment.
Location, Pose, and Map Update
Kalman Filter
As we proceed further with the SLAM process, it accumulates noise and develops an uncertainty between the images captured by the camera and its associated motion. By continuously generating predictions, updating, and fine-tuning the model against the observed measurements, Kalman filters can reduce the effect of noise and uncertainty among different measurements. This helps in creating a linear system model. For actual implementations of SLAM, we use the extended Kalman filters (EKF), which take nonlinear systems, and linearize the predictions and measurements around their mean. EKF can integrate data from multiple sensors by performing sensor fusion (e.g., cameras, IMUs) to improve the accuracy of the state and map estimates. This fusion of data sources helps in achieving more reliable SLAM results. The state vector in EKF-based SLAM includes the robot's pose (position and orientation) and the positions of the landmarks in the map.
Keyframe Selections
Keyframes are chosen among the captured images to reduce the heavy computation required to process all the images captured. Instead, we choose frames that can serve as a good representation of the environment and only use them for computation. This method is again a tradeoff between accuracy and efficiency.
Error correction through Loop closure and relocalization
As we progress in the process of building a model for the environment, there is a gradual accumulation of measurement errors and sensor drift happenings, and this affects the generated map. This can be mitigated to an extent through loop closure. Loop closure happens when the system identifies that it is revisiting a previously mapped area. By re-aligning the current map with the previously established map of the same area SLAM system can correct the accumulated error.
Relocalization
Relocalization is needed when the system loses track of its position and orientation (POSE). Now, we need to re-estimate the pose by utilizing the currently observable features. Once the system successfully matches the currently available features against the available map, we can then proceed with the SLAM process as usual.
Conclusion
SLAM is the process of estimating a robotic vehicle's position while mapping an unknown area. SLAM techniques include filter-based, graph-based, and deep learning-based (using neural networks) methods. Visual SLAM employs cameras to capture images, extract features, match them, and filter out incorrect matches. The system creates a map by associating new features with known elements, updating the robot's location and pose using Kalman filters, selecting keyframes, and correcting errors through loop closure and relocalization.
References
- Remote Sensing | Free Full-Text | SLAM Overview: From Single Sensor to Heterogeneous Fusion (mdpi.com)
- Understanding how V-SLAM (Visual SLAM) works | Kudan global
- Feature-based visual simultaneous localization and mapping: a survey | Discover Applied Sciences (springer.com)
- Introduction to Visual SLAM: Chapter 1 —Introduction to SLAM | by Daniel Casado | Medium
- An Introduction to Key Algorithms Used in SLAM - Technical Articles (control.com)
- What Is SLAM (Simultaneous Localization and Mapping) – MATLAB & Simulink - MATLAB & Simulink (mathworks.com)
- A survey of state-of-the-art on visual SLAM - ScienceDirect

Have questions or comments? Continue the conversation on TechForum, DigiKey's online community and technical resource.
Visit TechForum