Environmental Perception for Navigation

Learn Photo Editing

Learn Photo Editing

Get Instant Access

Traditionally, localisation has been identified as a principal perceptual component of the navigation system of a mobile robot [53]. This has driven continuous research and development on sensors providing direct localisation measurements.

There is a large variety of self-localisation solutions available [5] in the literature. However, in general they are characterised by a hard and limiting tradeoff between robustness and cost. As paradigmatic and extreme examples we can refer to solutions based on artificial landmarks (beacons) and those based on odometry. Solutions based on beacons are robust but expensive in terms of the materials, installation, maintenance or configuration to fit a specific new purpose. The solutions based on odometry are inexpensive, but since they rely on the integration of the robot's internal measurements, i.e. not grounded to the world, errors accumulate over time.

We use vision to sense the environment as it allows navigation to be regulated by the world. In particular, we have noted the advantages of omnidirectional vision for navigation, including its flexibility for building environmental representations. Our robot combines two main navigation modalities: Visual Path Following and Topological Navigation. In Visual Path Following, the short-distance / high-accuracy navigation modality, the orthographic view of the ground plane is a convenient world model as it makes simple representing / tracking ground plane features and computing the pose of the robot. Panoramic views are a complementary representation, which are useful in the identification and extraction of vertical line features. These types of views are easily obtained from omnidirectional cameras using image dewarpings.

In Topological Navigation, the large-distance low-precision navigation modality, omnidirectional images are used in their raw format to characterise the environment by its appearance. Omnidirectional images are advantageous as they are more robust to occlusions created e.g. by humans. Visual servoing is included in topological navigation as the means of providing local control.

This eliminates the need to built highly detailed environment representations, thus saving computational (memory) resources.

In summary, both Visual Path Following and Topological Navigation rely upon environmental perception (self-localisation) for regulating movement. The main point here is that perception is linked to internal representations of the world which are chosen according to the tasks at hand. We will now detail Geometrical Representations for precise self-localisation, necessary for Visual Path Following, and Topological Representations for global positioning leading, necessary for Topological Navigation.

3.1 Geometric Representations for Precise Self-Localisation

Robot navigation in cluttered or narrow areas, such as when negotiating a door traversal, requires precise self-localisation in order to be successful. In other words, the robot has to be equipped with precise environmental perception capabilities.

Vision-based self-localisation derives robot poses from images. It encompasses two principal stages: image processing and pose-computation. Image processing provides the tracking of features in the scene. Pose-computation is the geometrical calculation to determine the robot pose from feature observations, given the scene model.

Designing the image processing level involves modelling the environment. One way to inform a robot of an environment is to give it a CAD model, as in the work of Kosaka and Kak [52], recently reviewed in [24]. The CAD model usually comprises metric values that need to be scaled to match the images acquired by the robot. In our case, we overcome this need by defining geometric models composed of features of the environment directly extracted from images.

Omnidirectional cameras based on standard mirror profiles, image the environment features with significant geometrical distortion. For instance, a corridor appears as an image band of variable width and vertical lines are imaged radially. Omnidirectional images must therefore be dewarped in order to maintain the linearity of the imaged 3D straight lines.

Pose-computation, as the robot moves in a plane, consists of estimating a 2D position and an orientation. Assuming that the robot knows fixed points in the environment (landmarks) there are two main methods of self-localisation relative to the environment: trilateration and triangulation [5]. Trilateration is the determination of a vehicle's position based on distance measurements to the landmarks. Triangulation has a similar purpose but is based on bearing measurements.

In general, a single image taken by a calibrated camera provides only bearing measurements. Thus, triangulation is a more "natural" way to calculate self-localisation. However, some camera poses / geometries provide more information. For example, a bird's eye view image (detailed in the following subsection) provides an orthographic view of the ground plane, providing simultaneous observation of bearings and distances to floor landmarks. Given distances and bearings, the pose-computation is simplified to the calculation of a 2D rigid transformation.

The fact that the pose-computation is based on feature locations, implies that they contain errors, propagated from the feature tracking process. To overcome this, we propose a complimentary pose-computation optimisation step, based on a photometric criterium. We term this optimisation fine pose adjustment, as opposed to the pose-computation based on the features which is termed coarse pose computation. It is important to note that the pose-estimation based on features is important for providing an initial guess for the fine pose adjustment step.

Image Dewarpings for Scene Modelling

Images acquired with an omni-directional camera, e.g. based on a spherical or hyperbolic mirror, are naturally distorted. Knowing the image formation model, we can correct some distortions to obtain Panoramic or Bird's Eye Views.

The panoramic view groups together, in each scan line, the projections of all visible points, at a constant angle of elevation. The bird's eye view is a scaled orthographic projection of the ground plane. These views are advantageous e.g. for extracting and tracking vertical and ground plane lines.

Panoramic and Bird's Eye Views are directly obtained by designing custom shaped mirrors. An alternative approach, as described next, is to simply dewarp the omnidirectional images to the new views.

Panoramic View: 3D points at the same elevation angle from the axis of the catadioptric omnidirectional vision sensor, project to a 2D circle in the image. Therefore, the image dewarping is defined simply as a cartesian to polar coordinates change:

where (uo,vo) is the image centre, a and R are the angle and radial coordinates. The steps and range of a and R are chosen according to the resolution, and covering all the effective area, of the omnidirectional image. One rule for selecting the step of a is to make the number of columns of the panoramic image equal to the perimeter of the middle circle of the omnidirectional image. Hence inner circles are over-sampled and outer circles are sub-sampled. This rule gives a good tradeoff between data loss due to sub-sampling and memory consumption for storing the panoramic view.

Bird's Eye View: In general, 3D straight lines are imaged as curves in the omnidirectional image. For instance, the horizon line is imaged as a circle. Only 3D lines that belong to vertical planes containing camera and mirror axis project as straight (radial) lines.

In order to dewarp an omnidirectional image to a bird's eye view, notice that the azimuthal coordinate of a 3D point is not changed by the imaging geometry of the omnidirectional camera. Therefore, the dewarping of an omnidirectional image to a bird's eye view is a radial transformation. Hence, we can build a 1D look up table relating a number of points at different radial distances in the omnidirectional image and the respective real distances. The 1D look up table is the radial transformation to be performed for all directions on an omnidirectional image in order to obtain the bird's eye view.

However, the data for building the look up table is usually too sparse. In order to obtain a dense look up table we use the projection model of the omnidirectional camera. Firstly, we rewrite the projection operator, Pp in order to map radial distances, pgr0und measured on the ground plane, to radial distances, pimg, measured in the image:

Using this information, we build a look up table that maps densely sampled radial distances from the ground plane to the image coordinates. Since the inverse function cannot be expressed analytically, once we have an image point, we search the look up table to determine the corresponding radial distance on the ground plane.

Figure 6 illustrates the dewarpings of an omnidirectional image to obtain the Bird's Eye and Panoramic Views. Notice that the door frames are imaged as vertical lines in the Panoramic view and the corridor guidelines are imaged as straight lines in the Bird's Eye view, as desired.

As a final remark, notice that our process to obtain the look up table encoding the Bird's Eye View, is equivalent to performing calibration. However, for

Fig. 6. Image dewarping for bird's eye and panoramic views. (Top-left) original omnidirectional image, (top-right) bird's eye view and (bottom) panoramic view

our purposes a good dewarping is simply the one that makes straight lines on the ground plane appear straight in the Bird's Eye View.

As long as the mirror, camera and support (mobile platform) remain fixed to each other, the dewarpings for panoramic and bird's eye views are time invariant and can be programmed with 2D lookup tables. The dewarpings are done efficiently in this way.

Doing fixed image dewarpings is actually a way to do (or help) Scene Modelling. The image dewarpings make geometrical properties of the scene clearly evident and as such simplify scene modelling to collecting a number of features.

Geometric Scene Modelling and Model Tracking

Geometric models of the scene are collections of segments identified in Bird's Eye and Panoramic views5. Ground segments are rigidly interconnected in the Bird's Eye views while vertical segments will vary their locations according to the viewer location. Considering both types of segments, the models are "wire-frames" whose links change according to the viewpoint.

Each scene model must have a minimal number of features (line segments) in order to allow self-localisation. One line of the ground plane permits finding only the orientation of the robot and gives a single constraint on its localisation. Two concurrent ground lines, or one ground and one vertical, already determine robot position and orientation. Given three lines either all vertical, one on the ground, two on the ground (not parallel) or three on the ground (not all parallel), always permit us to compute the pose and therefore form valid models6.

Figure 7 shows one example of modelling the scene using line segments observed directly in the scene. The model is composed of three ground lines, two of which are corridor guidelines, and eight vertical segments essentially defined by the door frames. A single door frame (i.e. two vertical lines) and one corridor guideline would suffice but it is beneficial to take more lines than minimally required in order to improve the robustness of self-localisation.

In order to represent a certain scene area, and to meet visibility7 and quality criteria, a minimal number of segments are required. Models characterising different world regions are related by rigid 2D transformations. These transformations are firstly defined between every two neighbour models at locations where both models are (partially but with enough relevance) visible. Navigation is therefore possible in the area composed as a union of individual areas, provided by each individual model.

5 Despite the fact that localisation can be based on tracked image corners [74], more robust and stable results are obtained with line segments as noted for example by Spetsakis and Aloimonos in [78].

6 Assuming known the xy coordinates of the intersection of the vertical line(s) with the ground plane.

7 see Talluri and Aggarwal in [83] for a geometrical definition of visibility regions

Fig. 7. Geometric models for a door crossing experiment. The segments composing the model (bottom right) are illustrated in the panoramic and bird's eye view images, respectively (top and bottom left)

Assuming that the robot pose evolves smoothly over time, the model segments need to be detected only once - at the initialisation stage. From then on, we need only track them, which is much more efficient in computational terms. We track both edges lying on the ground plane and vertical edge segments, using respectively the bird's eye and panoramic views (details in [31]).

Pose Computation

The self-localisation procedure is based on the tracking of the geometric models. The tracking of the models requires rigidity of the world structure (but naturally not rigidity of the observed model segments themselves).

A simple method of calculating pose from the models arises when the segments of the model intersect at ground points (as in the model shown in Fig. 7). In this case, the model, despite encompassing ground and vertical segments, is simplified to the case of a set of ground points. This set of points moves rigidly in the Bird's Eye View, and therefore self-localisation is in essence the computation of the 2D transformation tracking the movement of the points. This method requires intersecting segments, which is similar to tracking corners but in a much more stable manner. This is especially true when dealing with long segments, as the noise in the orientation of small segments may become significant, affecting the computation of the intersections and the quality of corner estimates.

Alternatively, localisation is achieved through an optimisation procedure, namely minimizing the distance between model and observed line segments, directly at the pose parameters. Intuitively, the best pose estimate should align the scene model and the observed lines as well as possible. This is computationally more expensive, but more robust to direction errors on the observed line segments [34].

Defining pose as x = [x y 0] and the distance between the segments ab and cd as d (cd, ab) = f (c — a,b — a) + f (d — a,b — a) where a, b, c, d are the segment extremal points and f is the normalised internal product, f (v, v0) = |vT.vQ /|| vq ||, the problem of pose estimation based on the distance between model and observed segments can be expressed by the minimization of a cost functional:

i where si stands for observed vertical and ground line segments, and s0i indicates the model segments (known a priori). The minimization is performed with a generic gradient descent algorithm provided that the initialisation is close enough. For the initial guess of the pose there are also simple solutions such as using the pose at the previous time instant or, when available, an estimate provided by e.g. a 2D rigid transformation of ground points or by a triangulation method [4].

The self-localisation process as described by Eq. (17), relies exclusively on the observed segments, and looks for the best robot pose justifying those observations on the image plane. Despite the optimization performed for pose-computation, there are residual errors that result from the low-level image processing, segment tracking, and from the method itself. Some of these errors may be recovered through the global interpretation of the current image with the a priori geometric model. Since the model is composed of segments associated with image edges, we want to maximize the sum of gradients, VI at every point of the model wire-frame, {Pi}. Denoting the pose by x then the optimal pose x* is obtained as:

i where P is the projection operator and ^(x) represents the (matching) merit function. Given an initial solution to Eq. (17), the final solution can be found by a local search on the components of x.

Usually, there are model points that are non-visible during some time intervals while the robot moves. This is due, for example, to camera (platform) self-occlusion or to the finite dimensions of the image. In these cases, the merit matching merit function does not smoothly evolve with pose changes: it is maximized by considering the maximum number of points possible, instead of the true segment pose. Therefore, we include a smoothness prior to the function. One solution is to maintain the gradient values at control points of the model for the images when they are not visible.

Toward Robot Perception through Omnidirectional Vision 245 Visual Path Following

Visual Path Following can be described in a simple manner as a trajectory following problem, without having the trajectory explicitly represented in the scene. The trajectory is only a data structure learnt from example / experience or specified through a visual interface.

Visual Path Following combines the precise self-localisation (detailed in the preceding sections) with a controller that generates the control signals for moving the robot, such as that proposed by de Wit et al [21].

Experiments were conducted using an omnidirectional camera with a spherical mirror profile (shown in Fig. 3), mounted on a TRC labmate mobile robot. Figure 8 illustrates tracking and self-localization while traversing a door from the corridor into a room. The tracked features (shown as black circles) are defined by vertical and ground-plane segments, tracked in bird's eye view images.

Currently, the user initializes the relevant features to track. To detect the loss of tracking during operation, the process is continuously self-evaluated by the robot, based on gradient intensities obtained within specified areas around the landmark edges (Eq. 18). If these gradients decrease significantly compared to those expected, a recovery mechanism is launched.

Fig. 8. Feature tracking at three instants (a,b,c), scene-model tracking in the robot coordinate system (d) and the self-localisation result obtained by fixing the tracked scene-model (e)

The appropriate choice of the sensor and environmental representations, taking into account the task at hand, results in an efficient methodology that hardwires some tasks requiring precise navigation.

3.2 Topological Representations

A topological map is used to describe the robot's global environment and obtain its qualitative position when travelling long distances. It is represented as a graph: nodes in the graph correspond to landmarks, i.e. distinctive places such as corners. Links connect nodes and correspond to environmental structures that can be used to control the pose of the robot. In order to effectively use this graph the robot must be able to travel along a corridor, recognize the ends of a corridor, make turns, identify and count door frames. These behaviours are implemented through an appearance based system and a visual servoing strategy.

An appearance based system [62] is one in which a run-time image is compared to a database set for matching purposes. For example, in our corridor scene, the appearance based system provides qualitative estimates of robot position and recognizes distinctive places such as corner or door entrances.

Therefore, the topological map is simply a collection of inter-connected images. To go from one particular locale to another, we do not have to think in precise metric terms. For example, to move the robot from one corner to the opposite one we may indicate to the robot to follow one corridor until the first corner and then to follow the next corridor until the next corner, thus reaching the desired destination, or to complete more complex missions such as "go to the third office on the left-hand side of the second corridor

To control the robot's trajectory along a corridor, we detect the corridor guidelines and generate adequate control signals to keep the robot on the desired trajectory. This processing is performed on bird's eye views of the ground plane, computed in real-time.

When compared to geometric approaches, topological maps offer a parsimonious representation of the environment, are highly computationally efficient [85], scale easily and can explicitly represent uncertainties in the real world [7].

Image Eigenspaces as Topological Maps

In general, sizeable learning sets are required to map the environment and so matching using traditional techniques, such as correlation, would incur a very high computational cost. If one considers the images as points in space, it follows that they shall be scattered throughout this space, only if they differ significantly from one other. However, many real-world environments (offices, highways etc.) exhibit homogeneity of structure, leading to a large amount of redundant information within the image set. Consequently, the images are not scattered throughout a high dimensional space but - due to their similarity -lie in a lower dimensional subspace.

We implement dimensionality reduction using the classical procedure of Principal Component Analysis (PCA)8, as described by Murase and Nayar in [62], and detailed by Winters in [88] or Gaspar, Winters and Santos-Victor in [35]. Simply put, Principal Component Analysis reduces the dimensionality of a set of linearly independent input variables, while still accurately representing most of the original data. The reconstruction of this original data is optimal in the sense that the mean square error between it and the original data is minimized.

Imagine that we represent images as L-dimensional vectors in RL. Due to the similarity between images (data redundancy) these vectors will not span the entire space of RL but rather, they will be confined (or close, to be more precise) to a lower-dimensional subspace, RM where M << L. Hence, to save on computation, we can represent our images by their co-ordinates in such a lower-dimensional subspace, rather than using all of the pixel information. Each time a new image is acquired, its capture position can easily be determined by projecting it into the lower-dimensional subspace and finding its closest match from the a priori set of points (images).

A basis for such a linear subspace can be found through PCA, where the basis vectors are denominated Principal Components. They can be computed as the eigenvectors of the covariance matrix of the normalised set of images acquired by the robot. The number of eigenvectors that can be computed in such a way is the same as the number of images in the input data, and the eigenvectors are the same size as the images.

Each reference image is associated with a qualitative robot position (e.g. half way along the corridor). To find the robot position in the topological map, we have to determine the reference image that best matches the current view. The distance between the current view and the reference images can be computed directly using their projections (vectors) on the lower dimensional eigenspace. The distance is computed between M-dimensional coefficient vectors (typically 10 to 12), as opposed to image size vectors (128 x 128). The position of the robot is that associated with the reference image having the lowest distance.

When using intensity images, comparison of images is essentially a sum of squared differences of brightness (radiance) values and because of this the robot is prone to miscalculating its location where large non-uniform deviations in illumination occur. This can be overcome by using edge images, although these are not robust to edge-point position errors. The solution therefore, is to compare shapes instead of edge-points, or more specifically the (distance between shapes present in both images. There are several possible

8 It is sometimes known as the application of the Karhunen-Loeve transform [67, 86].

definitions of the distance between shapes. Two very well known are chamfer distance and the the Hausdorff distance.

Localisation Based on the Chamfer Distance

The chamfer distance is based on the correlation of a template edge-image with a distance transformed image. The distance transform of an edge-image is an image of the same size as the original, that indicates at each point the distance to the closest edge point [6, 36, 16].

The chamfer (distance transform9 is computed from an edge-image using the forward and backward masks shown in Fig. 9 [6, 36]. There are various possible values for the constants in the masks. We use the values according to Montanari's metric [16].

The constants shown in the masks are added to each of the local values and the resulting value of the mask computation is the minimum of the set. Both masks are applied along the rows of the initialised image.

Figure 10 shows the distance transform of the edges of an omnidirectional image. We remove the inner and outer parts of the omnidirectional image as they contain artifact edges, i.e. edges not related to the scene itself, created by the mirror rim and the robot plus camera self-occlusion.

Finally, given the distance transform, the chamfer distance of two shapes is then computed as the correlation:

+ 42





Was this article helpful?

0 0
Learn Photoshop Now

Learn Photoshop Now

This first volume will guide you through the basics of Photoshop. Well start at the beginning and slowly be working our way through to the more advanced stuff but dont worry its all aimed at the total newbie.

Get My Free Ebook

Post a comment