Demystifying Visual Odometry

Demystifying Visual Odometry

The Path to Precise Indoor Robot Navigation

Welcome back to the World of Indoor Robots!

In our earlier blog (find it here), we delved into the fascinating technologies powering self-navigating indoor robots, focusing on the intricacies of SLAM (Simultaneous Localization and Mapping) and its variants. In this article, we will simplify the essence of Visual Odometry, exploring its role in path navigation and the challenges it faces in unpredictable indoor environments. Future posts will delve into smart methods to conquer these hurdles, building a robust, real-time solution for smooth indoor robot navigation!

Self Navigation and Odometry

Self-navigation refers to the ability of a robot or autonomous system to move and navigate through its environment without direct human intervention. In other words, a self-navigating robot can autonomously plan and execute its path, make decisions about where to go, and avoid obstacles without relying on continuous human input or remote control. Self-navigation involves a combination of sensing, perception, decision-making, and motion control capabilities. The robot uses various sensors, such as wheel encoders, cameras, lidars, or sonars, to perceive its surroundings and create a representation of the environment. Based on this perception, the robot plans a path to reach its desired destination while avoiding obstacles and adhering to safety constraints.

Odometry is a technique used in robotics and mobile systems to estimate the position and orientation (pose) of a moving entity, such as a robot or a vehicle, based on the known motion of its wheels or other motion sensors. It involves calculating the change in position over time by integrating the measured incremental motion (displacements) from various sensors, such as wheel encoders or inertial measurement units (IMUs). 

Wheel Encoders

Wheel encoders are sensors commonly used in robotics and vehicles to measure the rotation of wheels. They work on the principle of detecting the rotation of the wheel and converting this information into digital signals that can be processed by a control system. There are two main types of wheel encoders - optical encoders that use a patterned disk and light to detect changes, and magnetic encoders that use magnets and a magnetic sensor. 

Optical encoders use an infrared light-emitting diode (LED) and a photodetector to detect the wheel's rotation. The wheel is equipped with a patterned disk, known as the encoder disk, that has alternating transparent and opaque sections. As the wheel rotates, the encoder disk rotates as well, causing the light to be interrupted and allowed to pass through the transparent and opaque sections. The photodetector detects these changes in light intensity and generates electrical pulses that correspond to the wheel's rotation.

Magnetic encoders use a magnetic sensor to detect changes in the magnetic field as the wheel rotates. The wheel is equipped with a magnetized disk or a series of magnets attached to its surface. As the wheel rotates, the magnetic field around the wheel changes, and the magnetic sensor detects these variations, producing electrical pulses that represent the wheel's rotation.

The number of pulses generated by the wheel encoder corresponds to the number of times the wheel has rotated. By counting these pulses, the control system can calculate the distance the wheel has traveled and estimate the robot or vehicle's position and movement.

Wheel encoders provide crucial feedback for motion control and odometry calculations, enabling precise control of the robot's movement and navigation in various environments. However, they may be subject to some errors due to factors like slippage, uneven surfaces, or sensor noise, which may need to be considered and compensated for in the system design.

Inertial Measurement Units (IMU)

IMUs (Inertial Measurement Units) are electronic devices that combine multiple sensors to measure and provide information about an object's orientation, velocity, and acceleration in three-dimensional space. IMUs typically consist of three primary types of sensors -  accelerometers, that measure linear acceleration along the X, Y, and Z axes, gyroscopes, that measure the rate of angular rotation about the X, Y, and Z axes and magnetometers, that measure the strength and direction of the magnetic field to determine the object's orientation with respect to the Earth's magnetic field.

By combining the data from these sensors, IMUs can track the motion and orientation of objects in real-time. However, IMUs are subject to drift and cumulative errors over time. 

Thus, though odometry provides real-time estimates of the entity's trajectory, it accumulates errors over time. Hence, it is typically used as a starting point or is used in conjunction with other visual sensors such as cameras, to provide more accurate and robust motion and orientation estimation. This fusion of multiple sensor inputs is a common approach in systems like visual-inertial odometry and SLAM, which leverage the strengths of different sensors to achieve more reliable localization and mapping results.  

Visual Odometry: Tracking Motion through the Eyes of the Camera

Visual odometry is a technique used in robotics and computer vision to estimate the motion and position of a moving camera or robot by analyzing sequences of images taken from the camera. It relies on the visual features detected in the images, such as key points or landmarks, to track the camera's movement over time. By comparing the visual features between consecutive frames, visual odometry algorithms can calculate the camera's relative displacement and estimate its trajectory.

Visual odometry is particularly valuable in situations where wheel encoders or other motion sensors may not be reliable, such as in rough terrains or environments with limited sensor information.

Visual odometry involves the steps described below.

Feature Extraction

In this step, distinctive visual features are detected and extracted from consecutive images captured by the camera. These features can be keypoints, such as corners, edges, or blobs, that have distinct patterns and can be easily identified in different images. Popular feature extraction techniques include Harris corner detection, SIFT (Scale-Invariant Feature Transform), ORB (Oriented FAST and Rotated BRIEF), and AKAZE (Accelerated-KAZE).

Feature Matching

Once the features are extracted from two consecutive images, the next step is to match the corresponding features between the frames. Feature matching aims to find the same visual feature points in both images. This is often achieved using algorithms like the nearest neighbor search, where the distances between feature descriptors are computed to find the best matches. Some matching methods may include additional filtering or robust estimation to handle outliers and false matches.

Motion Estimation

With the corresponding feature matches between the images, the motion estimation process begins. The goal is to calculate the relative camera motion between the two frames. This motion is typically represented as the camera's translation (change in position) and rotation (change in orientation) relative to the initial frame. The motion can be estimated using RANSAC (Random Sample Consensus) algorithms such as the 5-point or 8-point algorithms, to handle outliers and improve accuracy.

Trajectory Estimation

To obtain the camera's trajectory over time, the relative motions estimated in successive image pairs are integrated. The trajectory estimation involves cumulatively accumulating the camera's position and orientation changes. By continually updating the camera's position based on the relative motions between consecutive frames, the visual odometry system can infer the camera's path through the environment. This provides a relative trajectory that indicates how the camera has moved from its starting position, but it does not provide an absolute position without additional techniques like loop closure and global localization.

Visual Odometry in Miko

The schematic below outlines the various modules that comprise the visual odometry (VO) system developed at Miko

Testing our pure visual-odometry (VO) system on the well known EuRoC dataset -V2_01 reveals that the trajectories obtained have a tendency to drift away from the ground truth, as seen in the plot below. 

This outcome is anticipated, given the inherent limitations of a traditional Visual Odometry (VO) system. We will explore these limitations in detail in the following section.

Challenges of Visual Odometry

Visual odometry, while a powerful technique, comes with several pitfalls and challenges that can affect its accuracy and robustness. In the context of companion robots, additional challenges are introduced due to the indoor settings (please refer to our previous blog for challenges due to indoor settings). We summarize some of the key challenges of visual odometry below.

Feature Ambiguity

In certain home environments or scenes with repetitive patterns or low-textured surfaces and homogeneous surfaces (eg: plain walls), it can be challenging for visual odometry algorithms to extract distinctive and unique features. This feature ambiguity may lead to incorrect feature matching and, consequently, erroneous motion estimation.


When objects or obstacles such as pieces of furniture, partially or completely occlude the visual features, the ability to match features between consecutive frames is hindered. Occlusions can cause gaps in the motion estimation, leading to incomplete or incorrect trajectories.

Motion Blur and Fast Movement

Rapid camera motion indoors can lead to blurring of images. This can make it difficult to accurately detect and match features, which can result in motion estimates with high uncertainty and increased error accumulation.

Lighting Variations

Changes in indoor lighting conditions, such as shadows, glare, or varying illumination, can affect the appearance of visual features in images, leading to difficulties in feature extraction and matching.

Limited Field of View 

Indoor cameras may have a restricted field of view, limiting the amount of information available for motion estimation and leading to drift.

Scale Ambiguity

Lack of absolute scale in indoor scenes can lead to scale ambiguity in visual odometry, affecting the estimation of distances and trajectory.

Drifting and Error Accumulation

Visual odometry is a relative localization technique, which means that errors in motion estimation can accumulate over time. This drift can result in the trajectory gradually deviating from the true path.

Lack of Loop Closure

Visual odometry typically provides a relative trajectory but lacks absolute localization information. Without loop closure techniques, which identify previously visited locations, accumulated errors cannot be corrected, leading to a less accurate map.

Real-Time Performance

Running visual odometry algorithms in real-time can be computationally intensive, especially when dealing with high-resolution images or complex scenes. This can limit its use in resource-constrained systems.

Addressing these pitfalls often requires integrating visual odometry with other localization and mapping techniques, such as loop closure detection and SLAM (Simultaneous Localization and Mapping), to improve accuracy, handle drift, and build more reliable maps of the environment. Additionally, robust feature extraction and matching algorithms, as well as sensor fusion with other sensors like IMUs or depth sensors, can help overcome some of these challenges and enhance the overall performance of visual odometry. 

Thank you for joining us in exploring the fundamentals of self-navigation with visual odometry. In our upcoming blogs, we will examine strategies to address the above challenges of visual odometry, leading to a robust and effective solution for self-navigating companion robots!

Back to blog