Imperial College, London
information systems engineering year 2:
Surprise 1997

Oliver Henlich
27 May 1997


  • Introduction
  • Vision Sensor (Camera) Model and Localisation
  • Landmark-Based Positioning
  • Two-Dimensional Positioning Using a Single Camera
  • Two-Dimensional Positioning Using Stereo Cameras
  • Model-Based Approaches
  • Three-Dimensional Geometric Model-Based Positioning
  • Digital Elevation Map-Based Localisation
  • Feature-Based Visual Map Building
  • Conclusion
  • Relative and Absolute Positioning
  • Relative Position Measurements
  • Absolute Position Measurements
  • Summary of some of the current vision-based navigation systems (on the Local level)
  • Bibliography

  • Introduction

    Vision based positioning or localisation uses the same basic principles of landmark-based and map-based positioning but relies on optical sensors rather than ultrasound, dead-reckoning and inertial sensors. The advantage of these type of sensors lies in their ability to directly provide distance information needed for collision avoidance. They have an important drawback in that only vertical structures (ie. mainly the shape of the free space surrounding the robot) can be recognised.

    Real world applications envisaged in most current research projects however, demand more detailed sensor information to provide the robot with better environment-interaction capabilities. Visual sensing can provide the robot with an incredible amount of information about its environment. Visual sensors are potentially the most powerful source of information among all the sensors used on robots to date. Hence, at present, it seems that high resolution optical sensors hold the greatest promises for mobile robot positioning and navigation.

    The most common optical sensors include laser-based range finders and photometric cameras using CCD arrays.

    However, due to the volume of information they provide, extraction of visual features for positioning is far from straightforward. Many techniques have been suggested for localisation using vision information, the main components of which are listed below:

    Most current localisation techniques provide Relative or Absolute position and/or orientation of sensors (ie. Robots).

    The environment is perceived in the form of geometric information such as landmarks, object models and maps in two or three dimensions. Localisation then depends on the following two inter-related considerations:

    When landmarks or maps are not available, landmark selection and map building should be part of a localisation method.

    Vision Sensor (Camera) Model and Localisation

    The most common model for photometric cameras is the pin-hole camera with perspective projection as shown in the figure below.

    Figure 1

    Photometric cameras using an optical lens can be modelled as a pin-hole camera. The co-ordinate system (X, Y, Z) is a three-dimensional camera co-ordinate system, and (x, y) is a sensor (image) co-ordinate system. A three-dimensional feature in an object is projected onto the image plane (x, y). The relationship for this projection is given by (where f is the focal length of the lens):

    Eq. 1

    The range information is lost in this projection, but the angle or orientation of the object point can be obtained if the focal length is known and the lens does not cause distortion.

    The Intrinsic camera parameters include the effective focal length and the image scanning parameters and are used to estimate the physical size of the image plane.

    The six Extrinsic camera parameters are used to describe the orientation and position (three for each) of the camera co-ordinate system (X, Y, Z). They represent the relationship between the camera co-ordinates (X, Y, Z) and the real world or object co-ordinates (XW, YW, ZW). Landmarks and maps are usually represented in the real world co-ordinate system.

    The problem of localisation is to determine the position and orientation of a mobile robot by matching the sensed visual features in one or more image(s) to the object features provided by landmarks or maps. To obtain accurate estimates for position and orientation, multiple features are required. Depending on the type of sensors, sensing schemes, and representations of the environment, localisation techniques vary significantly.

    Landmark-Based Positioning

    The representation of the environment can be in the form of very simple features such as points and lines, more complex patterns, or three-dimensional models of objects in the environment. Here, the approaches based on simple landmark features are considered.

    Two-Dimensional Positioning Using a Single Camera

    Consider a camera mounted on a mobile robot such that its optical axis is parallel to the floor and vertical edges in the environment provide landmarks. Positioning then clearly becomes a two-dimensional problem. In other words, the vertical edges provide point features and two-dimensional positioning requires identification of three unique features (Refer to the Imperial College Beacon Navigation System for an example of positioning using three points)


    If it is possible to identify unique features and their positions then the position and orientation of the pin-hole camera can be determined as illustrated in the figure below.
    Figure 2

    However, it is not always easy (or possible) to uniquely identify simple features such as points and lines in an image.

    If there are only two landmark points, the measurement of angles between the corresponding rays restricts the possible camera position to part of a circle as shown in Figure 3.a
    Figure 3

    Three landmark points uniquely determine the camera position which is one of the intersections of the two circles in Figure 3.b.

    The point location algorithm first establishes a correspondence between the three landmark points in the environment and three observed features in the image. Then, the algorithm measures the angles between rays.

    Two-Dimensional Positioning Using Stereo Cameras

    Here a stereo pair of cameras are used to determine a correspondence between observed landmarks and a pre-loaded map, and to estimate the two-dimensional location of the sensor from the correspondence. Landmarks are usually derived from vertical edges. By using two cameras for stereo range imaging the algorithm can determine positions of observed points (rather than using ray angles as above).

    Model-Based Approaches

    A priori information about an environment can be given in a more comprehensive form than features such as, two-dimensional or three-dimensional models of environment structure and digital elevation maps (DEM). The geometric models often include three-dimensional models of buildings, objects and floor maps. For localisation, the two-dimensional visual observations should capture the features of the environment that can be matched to the pre-loaded model with minimum uncertainty. The figure below illustrates this technique.
    Figure 4

    The main problem is that the two-dimensional observations and the three-dimensional world models are in different forms. This is basically because of the following problems in object recognition in computer vision:

    Three-Dimensional Geometric Model-Based Positioning

    A technique often used here is to match images to the map by first matching the two-dimensional projection of landmarks to lines extracted from the image. Matching is achieved by considering all possible sets of three landmarks. Once the correspondence between the model and the two-dimensional image is found, the relation of the robot to the real world co-ordinate system can be found. The relation is expressed as the Rotation and Translation that will match the robot- and world-systems.

    Another technique is to estimate the robots position and heading using it's other/conventional sensors. The approximate position is then used to generate a two-dimensional scene from the stored three-dimensional real world model. The features of this generated scene are then matched against those extracted from the observed image. This technique of image matching speeds up the process of obtaining a position estimate.

    Digital Elevation Map-Based Localisation

    This is primarily useful for outdoor positioning. It essentially consists of a hierarchical system that compares features extracted from a visual scene to features extracted from a digital elevation map (DEM). Features such as peaks, saddles, junctions and endpoints can be identified and extracted from the observed scene. Similarly, features like contours and ridges are extracted from the DEM. The objective of the system is to match the features from the scene onto a location in the map.

    Figure 5

    In order to make the matching process simpler, configurations of distinctive and easily identifiable features are matched first. Using a group of features cuts down dramatically on the number of possible comparisons. Making use of rare and easily spotted features is obviously advantageous to making an efficient match..

    Feature-Based Visual Map Building

    The positioning methods above use a priori information about the environment in the form of landmarks, object models or maps. These maps and absolute references for positions can be impractical since they only allow the robot to navigate in a known environment. When there is no a priori information, a robot can rely only on the information obtained by its sensors.

    For constructing the environment model, vision systems usually use image features detected at one or more robot positions. The object features detected in a sensor location become the relative reference for the subsequent sensor locations (ie. a form of dead-reckoning). When correspondences are correctly established, vision methods can provide higher accuracy in position estimation than odometry or inertial navigation systems. However, an important point to note is that odometry and inertial sensors can provide reliable position information up to a certain degree and this can be used to assist the establishment of correspondence by narrowing down the search space for feature matching. A map based on visual object features is clearly an inadequate description of environment structure.

    For mobile robots to be of more use in the future, they will need the ability to work in environments which are unknown at the design time of the robot and hence cannot be modelled in advance.

    A project which has attempted to implement a method for selforganising the visual perception of a mobile robot to adapt it to the surroundings without the need to define and model the relevant aspects in the environment seems to provide promising possibilities.

    The system is able to transform a continuous flow of images by means of a selforganisation process into a limited number of discrete perceptions which can be used for navigation purposes. Every image or scene is analysed in order to determine a set of features which characterise the scene. To define a standard set of features which characterise a scene precisely enough and at the same time be extracted from the image with reasonable effort, in a natural or not specially prepared environment, seems impossible. The various images are grouped into discrete perceptions by means of a quantization of features in a scene. This is achieved by means of a Growing Neural Gas Network (GNG Network). Basically the network chooses certain features and groups them into classes which define the environment it is in.

    This system provides a relatively new approach to vision based navigation for mobile robots. It does not use models a priory information of the environment and hence makes no restricting assumptions about its structure. This type of approach seems to hold a key to great possibilities for truly autonomous mobile robots with unusual flexibility.


    Most of the material discussed mentions techniques and methods that relate detected image features to object features in an environment. Although it seems to be a good idea to combine vision-based techniques with methods using dead-reckoning, inertial sensors, ultrasonic and laser-based sensors, applications under realistic conditions are still scarce.

    Clearly vision-based positioning is directly related to most computer vision methods, especially object recognition. So as research in this area progresses, the results can be applied to vision-based positioning. Other relevant areas include structure from stereo motion and contour.

    Another approach not mentioned in the above discussion is the idea of global vision. This makes use of several cameras placed at fixed locations in the environment to extend the local sensing capabilities of a mobile robot. The cameras 'track' the robot (and objects moving in its environment) and relay this data to it. The data is analysed to provide appropriate position and navigation information. For a detailed more detailed study of Global Multi-Perspective Perception for Autonomous Mobile Robots, please refer to Arun Katkere's report.

    Relative and Absolute Positioning

    Given all the research into the problem of mobile robot positioning, there does not seem to be a truly elegant solution to it to date. The many partial solutions can usually be categorised into two groups:

    Relative Position Measurements

    Odometry This method uses encoders to measure wheel rotation and/or steering orientation. It is always capable of providing the vehicle with an estimate of its position. 
    Inertial Navigation This method uses gyroscopes and sometimes accelerometers to measure rate and rotation of acceleration. 

    Absolute Position Measurements

    Active Beacons This method computes the absolute position of the robot from measuring the direction of incidence of three or more actively transmitted beacons. They must be located at known sites in the environment 
    Artificial Landmark  


    In this method distinctive artificial landmarks are placed at known locations in the environment. They are usually designed for optimal detectability even under adverse environmental conditions. 
    Natural Landmark 


    Here landmarks are distinctive features in the environment. There is no need to prepare the environment, but it has to be known in advance. 
    Model Matching In this method information acquired from the robots onboard sensors is compared to a map or world model of the environment. If the features from the sensor-based map and the world model map match, then the robots absolute location can be estimated. 

    Summary of some of the current vision-based navigation systems (on the Local level)