P.Remagnino1, A.Baumberg2, T.Grove1, D.Hogg2, T.Tan1, A.Worrall1, K.Baker1
1 Department of Computer Science , The University of Reading, Berkshire RG6 6AY, UK
2 School of Computer Studies , University of Leeds, LS2 9JT,UK
P.M.Remagnino@reading.ac.uk , email@example.com
The paper describes a novel integrated vision system in which two autonomous visual modules are combined to interpret a dynamic scene. The first module employs a 3D model-based scheme to track rigid objects such as vehicles. The second module uses a 2D deformable model to track non-rigid objects such as people. The principal contribution is a novel method for handling occlusion between objects within the context of this hybrid tracking system. The practical aim of the work is to derive a scene description that is sufficiently rich to be used in a range of surveillance tasks. The paper describes each of the modules in outline before detailing the method of integration and the handling of occlusion in particular. Experimental results are presented to illustrate the performance of the system in a dynamic outdoor scene involving cars and people.
The overall aim of the work described in this paper is to produce an autonomous vision system for performing a range of surveillance tasks in scenes containing rigid objects, vehicles, and non-rigid objects, people. We describe a hybrid system for tracking objects in such scenes by combining two widely reported model-based tracking modules. The first module - developed as part of the European project VIEWS - uses 3D geometric models to track rigid objects, providing continually updated estimates of position and orientation . The second module uses a 2D statistical model to track non-rigid objects, providing an estimate of 2D position, either within the image plane or within a scene-based ground-plane .
Within our intended application domain, objects are often heavily occluded by one another. For example, pedestrians walking through a car park may be only partly visible as they move between the lines of parked cars. Both tracking modules cannot perform reliably unless the effects of mutual occlusion are explicitly taken into account. Of particular interest is handling occlusions that occur between objects tracked by different modules. The main contribution in this paper is to propose a novel method for dealing with such occlusions.
Before describing the integration of the two tracking modules and the handling of occlusion, we outline the individual tracking modules in sufficient detail to motivate and explain this work. Full details of the tracking modules can be found in ,  and .
There is a considerable body of literature dealing with object tracking using computer vision. The approaches can be divided broadly into those that use explicit models for the objects of interest (or their projections), and those that exploit only general properties of objects, such as spatial contiguity, surface smoothness and inertia. The advantage of the latter approach is that the range of objects that can be tracked is not restricted by the set of available models. However, the use of explicit object models can significantly improve the reliability and speed of tracking. The system described in the current paper employs model-based methods, although a pre-processing to find moving objects is performed without specific objects models.
Many other groups have addressed the problem of tracking rigid objects such as vehicles. For example, Koller et al.  describe a 2D feature-based system for tracking vehicles for the purpose of highway surveillance. More recently, there has been a resurgence of interest in tracking non-rigid objects. Typically, different approaches have been found to be appropriate within the two problem domains.
To our knowledge, the work described here is the first attempt to integrate specialised modules dealing with rigid and non-rigid motion into a single system and to handle the occlusions between such objects.
Tracking the objects in a scene is only part of the surveillance task. The interpretation of this motion in terms of events and actions remains to be carried out. For example, Buxton and Shaogong  describe a system for the interpretation of motion using Bayesian belief nets. The need for a semantic layer as the integrating part of a scene description has been pointed out by Nagel .
Images are dealt with frame-by-frame and moving objects are detected by a separate sub-system working outside the two tracking modules. When a new moving object is detected, the appropriate tracking module is invoked to follow the object. Section 2.1 described this motion detection sub-system. Section 2.2 describes the vehicle tracking module and Section 2.3 describes the person tracking module. Integration details and the results of experiments are given in Sections 3 and 4 . Conclusions and future work are described in the last section.
The next two sections introduce the motion detection and the tracking modules. The intention is to show how the two tracker modules perform independently.
Regions of significant motion, which are not associated with already instantiated and tracked objects, are examined. Each region represents a dynamic event detected by background subtraction. At present the region is assigned to the most suitable tracking module using the dimensions of the region. In this domain large regions usually correspond to vehicles and smaller and vertically elongated regions to pedestrians. Future implementations may use more complex schemes. For instance, a natural way of improving the tracking module selection could be achieved by allowing conflicting hypotheses and choosing between them at a later stage or by using a common evaluation criterion based on the accumulation of spatial-temporal image evidence.
A background image is maintained to describe the stationary portion of the scene. Parked vehicles are classified as part of the background image. This is achieved easily by dynamically updating the background image using an approximation to a pixelwise median filter over time .
Once assigned a dynamic event, the vehicle module analyses the event region of interest (ROI) using a hypothesis and test scheme. The major steps are
In normal traffic scenes, the 3D pose of a road vehicle has only three degrees of freedom (DoF) as vehicles usually stand on the road (this is the so-called ground-plane constraint or GPC exploited in ). In our work, the initial pose of a vehicle is recovered by first determining its orientation and then its ground plane (GP) location. In the following, we outline our solutions to the various sub-tasks in the vehicle module.
If the vehicle class is known, a pre-compiled 3D geometric model is present. Under the GPC, a match between a 2D and a 3D line specifies two constraints on the three pose parameters and (see  for derivation)
where - are known coefficients. The first constraint only involves the orientation parameter and can thus be used to determine vehicle orientation independently of the vehicle's location on the GP. Two orientation determination algorithms have been developed to provide different trade-offs between accuracy, computational cost and hardware requirements .
Two related algorithms have also been developed to recover the GP location . The most efficient of these searches for simultaneous matches in a set of 3 separate 1D correlations. The coordinate system of each 3D model defines three major directions along which edges occur. Edge data is collapsed onto 1D projections of these directions to yield 3 templates which are correlated with the corresponding model templates. The intersection of the three peaks gives the projected model's location in the image plane (see  and  for further discussions on the technique).
In general, the pose recovery algorithms outlined above produce multiple candidate poses for the vehicle in the ROI. The quality of each candidate pose is measured by a scalar evaluation score obtained by comparing model projection and the image data. Each candidate pose is then tested and refined to find the optimum pose.
The method used examines each projected model line independently. The normal to a model image line defines a direction along which large image derivative values provide evidence for the line. Points of local derivative maxima are found for normals spaced along the line, and used to obtain an evaluation score. The score is expressed as the estimated value weighted by a prior probability precomputed using a Monte Carlo scheme (see  and  for a more thorough discussion of the technique).
The final evaluation score obtained proves to be reasonably independent of the pose, the object, and the scene. Starting from a seed pose (eg. that obtained by the above pose recovery algorithms), the pose space may subsequently be searched using an optimisation algorithm to determine the local optimum pose. The evaluation score obtained gives an absolute measure of the quality of the hypothesis.
One way of determining the vehicle class is by trying all the class models in turn. The model associated with the highest evaluation score identifies the class of the vehicle. The cost of the identification process is proportional to the number of models, but the orientation can be recovered independently of specific models as discussed above.
The pose and class determination algorithms just described have proved effective in handling a variety of vehicles in various outdoor traffic scenes. Classification based solely on the evaluation score may not work in complex cases, either because of occlusions or because of the particular vehicle pose. In such cases the vehicle module can perform classification using the temporal consistency of the evaluation score.
The evaluation score is collected for a fixed number of frames for each vehicle class hypothesis. The class is determined by the hypothesis with the highest accumulated score. The graph in Figure 1 shows the results of one experiment in which 50 frames were used to collect the evaluation scores.
The hatchback vehicle class, had the highest accumulated evaluation score. The saloon class, was the only very strong competitor during the first 30 frames. Occasional mismatches, leading to bad scores, are catered for by the technique, which manages to resolve ambiguities as the track evolves.
Once the pose and class of a vehicle has been identified the vehicle is then tracked through subsequent frames. The state of a moving vehicle at any time instant is described by a 6-tuple , where is the GP location, the orientation, and the tangential velocity and acceleration, and the steering angle. The motion trajectory of a vehicle is strongly constrained by vehicle dynamics and the driving behaviour which are defined by the evolution of the steering angle (controlled by the steering wheel) and the tangential acceleration (controlled by the accelerator and brake pedal). By regarding and as two stochastic processes, a filter can be developed to predict the pose of the vehicle in future frames (see  for further details on the stochastic filter).
If the dynamic event is assigned to the pedestrian visual module, a flexible 2D prior model of silhouette shape, based on B-spline contours, is used as an effective approach to recognising and tracking walking pedestrians. This is in contrast to more conventional 3D hand-crafted part-based models (e.g. Gavrilla and Davis , Rohr , Hogg ).
A prior shape model incorporates useful constraints on the apparent shape of a pedestrian silhouette that allows the system to cope with missing information due to image noise, background clutter and partial occlusions. A deformable model is required to model the apparent change in shape due to pose (i.e. position of limbs etc.) and viewpoint relative to the camera. An important feature of this approach is that the model is learned automatically, using training image sequences containing the object of interest. This allows the system to be trained up from a particular camera view to learn specific shape constraints appropriate to the given object in the given scene.
A further benefit of this approach is the very low dimensionality of the model and the linear nature of the transform from model space to image space. The consistent shape parametrisation offered by the B-spline representation allows the control points, representing each training shape, to be treated in a similar manner to the landmark features of Cootes et al.  and Bookstein . The shapes are aligned to a reference frame and Principal Component Analysis is performed on the spline data (using an appropriate distance metric for this shape-vector space ) to obtain a small set of modes of variation that capture a high proportion of the variability observed in the training data.
A statistically based tracking system is used to estimate model parameters for each moving pedestrian in the scene. The method used relies on a Kalman filter mechanism which proves computationally efficient allowing real-time tracking performance for several moving pedestrians.
The ROI assigned to the pedestrian module is thresholded to form a binary image. This is used to generate a new hypothesis. For the purposes of hypothesis generation, a pedestrian is assumed to have an average height (180 cm) and to stand vertically on the ground. This allows two hypotheses to be generated for each motion region.
Either the top of the region bounding box corresponds to a person's head, or the bottom line of the region bounding box corresponds to a person's feet. Each hypothesis corresponds to a different ground plane location. Image evidence for both (possibly) conflicting hypotheses is examined and the less likely hypothesis is rejected.
The image fitting process described previously (Section 2.3) is used to provide a fitness measure for the current hypothesis. The fitness measure is the percentage of the contour that is locked onto a significant image feature. If the fitness for a hypothesis is consistently high over several frames then the hypothesis is accepted. Otherwise the hypothesis is rejected and no longer examined.
Once a hypothesis is accepted as a moving person they are tracked until the uncertainty in position becomes too large and the track is lost. This can occur when a person walks out of the image or enters a building, car or is totally occluded for a significant length of time. Each pedestrian hypothesis in the scene can be represented by the centre image coordinates scale s, the orientation and a small set of shape parameters which are the weights for the associated eigenshape basis vectors.
An iterated Kalman filter provides a useful framework for parameter estimation in a single static image. The method summarised here has been described in previous work   , and is based on the earlier work of Blake . The object state parameters are grouped into the position parameters (and the velocity), the alignment parameters (rotation and scale) and each shape parameter. For computational speed, each group is filtered independently.
The Kalman filter maintained for each shape parameter. holds an estimate of the shape parameter, and the associated variance for the estimate, . The shape parameters are initialised to zero (i.e the mean shape) and the associated variance set to the variance of the parameter over the training set. The position, rotation and scale parameters are initialised using the information provided by the dynamic event detector.
At each iteration regularly spaced sample points are calculated along the estimated B-spline contour. Measurements are made by searching for suitable features (such as large changes in intensity) along an optimal search line related to the contour normal direction . The search region is determined by each line by an uncertainty ellipse constructed from the positional covariance. The search region is chosen to detect features that lie within a Mahalanobis distance of 2 standard deviations from the expected position.
Each point measurement has an associated pointwise measurement covariance matrix.
A Kalman filter is used to combine the measurements in the shape parameter updating process. Between image frames, each shape parameter is assumed constant with a random perturbation term proportional to the variance of the shape parameter over the training set.
Alternatively a spatiotemporal model can be learned to give better predictions for walking pedestrians . The pedestrian is assumed to have a constant velocity with a random acceleration component. The GPC can be used to constrain the filtering process. The image height and position parameters can be mapped to plane position and height parameters. These parameters are filtered in a similar manner to the image-centred parameters and used to provide more accurate estimates for image position and image height. The associated covariance matrices can be projected between world and image frame by linearising the camera projection transformation using the current estimate of object depth.
The principal contribution of this paper is in the integration of the two visual modules described in the previous sections. This is achieved by transforming the 2D information provided by the person tracker into the 3D ground plane coordinate system, and then reasoning about occlusions in 3D.
Occlusion handling requires careful processing and is handled by the vision system using of two basic pieces of information
Unknown objects are objects which have not been identified. Two unknown objects may occlude one another, but if both objects are stationary the integrated system will not see them as separate objects. However, if either or both objects are moving, the occlusion might be resolved at a later stage by the system. If one of the objects is moving it will be detected as an event and it is very likely that the region will be assigned to a tracker based on its dimensions. As mentioned before more effective selection by the motion detection module can be provided at this stage only by using more complex techniques are used.
If the occluding and occluded objects are two vehicles, the system will use a hidden line removal technique to cater for the occlusions. If one of the two object is a pedestrian, the silhouettes and the ground plane map are used for the occlusion handling. If a pedestrian moves in front of a vehicle, the occluded area is too small to have a large effect on either the interpretation and the tracking of the vehicle. However, the use of silhouettes can be used to improve the matching or tracking process. On the other hand, if a person is occluded by a vehicle, the person might be only partially visible for sometime, but still detectable.
Partial occlusion has two effects on the person tracking process.
1. Distortion of measurements: The image fitting scheme is driven by image measurements such as local edge and motion based features. Partial occlusion (or near occlusion) by a second object will often distort these measurements because a false significant feature may be found that corresponds to the wrong object.
2. Hiding features: If a person walks behind a car then the image features may be occluded and no significant feature found within a sample point search region. This increases the total variance of the measurement process.
The increase in the variance is suitable for tracking because when there is partial occlusion the confidence in the measurements should decrease. However, hidden features must be compensated for in the validation phase. This is handled in the pedestrian tracker by ignoring image data where there is likely to be occlusion by using an occlusion mask. Figure 2(b) illustrates an example where a pedestrian is between a line of cars. In addition to ignoring possibly occluded measurements, the validation fitness measure is modified by considering the percentage of the unoccluded contour that is locked onto an image feature.
The silhouettes of people and vehicles have been superimposed in Figure 2 (b) to show the successful occlusion handling. In general, for each tracked object in front of a person the associated silhouette region in the occlusion mask should be set. Note that we need a ground plane map to decide when a person's ground plane position (considered as a point) is in front or behind each car's ground plane bounding box.
A simple depth ordered scheme such as that used by Koller  would not be sufficient to disambiguate all cases. An example of the problem with depth ordering is shown in Figure 2(a) and (b).
Figure 2(a) shows a top view of the ground plane with all the 3D objects (people are represented as cylinders), while Figure 2(b) shows that one person is occluded.
A car is considered in front of a person if the line from the person to the camera crosses the car bounding box on the ground plane.
Future approaches may use higher level semantic scene descriptions to cope with more complex interactions. For instance, where a person quickly approaches a vehicle and then abruptly changes course. The modelling and prediction of such a trajectory could make use of richer prior knowledge about the scene.
As mentioned before occlusion handling is the crucial problem in the integration of the two tracking modules. Figure 3 shows the occlusion masks for two tracked pedestrians. Figure 3(a) shows how the occlusions have been resolved. The occlusion masks for the furthest and nearest pedestrian are shown in Figure 3(b) and (c) respectively. Note that in each case the second pedestrian is included in the occlusion mask. This is because moving objects mutually occlude one another when motion-based image features are used. The reader should be aware that the nearest car is also moving and occludes both pedestrians.
The remaining cars are stationary and have been incorporated into the background image. These stationary cars only occlude pedestrians and vehicles that are behind them.
A more difficult case of occlusion is shown in Figure 4. A driver gets out of the car he has just parked. As he emerges, he is occluded by the body of the car. However, since the car is already being tracked, the silhouette corresponding to the car is used to mask out the detected moving region. Figure 4(b) shows the image feature search lines that are unoccluded for a particular pedestrian hypothesis. The tracker is able to recognise and track this object even though it is initially partially occluded.
The paper has shown the integration of two tracking modules into a vision system. It is composed of two visual modules capable of identifying and tracking vehicles and pedestrians in a complex dynamic scene. We have described how the integration was designed and presented some examples to illustrate the performance of the system. The paper describes integration at geometric level showing the pre-attentive capabilities of the system. New work is under way to design a system capable of producing a symbolic (semantic) description of the scene.
 G.D. Sullivan and K. D. Baker, Model-based Vision for Road Traffic Understanding, 26th ISATA-93, Sept. 1993.
 G.D.Sullivan, K. D. Baker, A.D. Worrall, C.I. Attwood and P.Remagnino, Model-based vehicle detection and classification using orthographic approximations , Proc of 7th British Machine Vision Conference, Vol. 2, pp 695-704, Sep 1996.
 T. N. Tan, G. D. Sullivan and K. D. Baker, Recognising Objects on the Ground-plane, Image and Vision Comput., vol.12, 1994, pp.164-172.
 G. D. Sullivan, Model-based vision for traffic scenes using the ground plane constraint , in D.Terzopoulos and C.Brown (Eds): Real-time Computer Vision", Cambridge University Press, 1994.
 T. N. Tan, G. D. Sullivan and K. D. Baker, Linear algorithms for object pose estimation, Proc. 4th BMVC, 1993, pp.125-134.
 T. N. Tan, K. D. Baker and G. D. Sullivan, Model-independent object orientation determination, IEEE Trans. Robotics and Automation (to appear).
 T. N. Tan, G. D. Sullivan and K. D. Baker, Efficient image gradient based object localisation and recognition, Proc. of CVPR96, 1996, pp.397-402.
 G. D. Sullivan, Visual interpretation of known objects in constrained scenes, Phil. Trans. Roy Soc. (B), vol.337, 1992, pp.361-370.
 S.J. Maybank, A.D. Worrall and G.D. Sullivan, A Filter for Visual Tracking Based on a Stochastic Model for Driver Behaviour, ECCV'96, pp 540-549, April 1996.
 D.M.Gavrilla and L.S.Davis, 3-D model-based tracking of humans in action: a multi-view approach, IEEE Computer Vision and Pattern Recognition, 1996.
 K. Rohr, Incremental Recognition of Pedestrians from Image Sequences, Computer Vision and Pattern Recognition, pp 8-13, 1993.
 D. Hogg, Model-Based Vision: A Program to See a Walking Person, Image and Vision Computing, Vol. 1(1), pp 5-20,1983.
 A. Baumberg and D. Hogg, Learning Flexible Models from Image Sequences , European Conference on Computer Vision, Vol. 1, pp 299-308. May 1994.
 A. Baumberg and D. Hogg, Generating Spatiotemporal Models from Training Examples, British Machine Vision Conference, Proc of 6th BMVA, Vol. 2, pp 413-422, 1995.
 A.Baumberg and D.Hogg, An Adaptive Eigenshape Model, Proc of the 6th British Machine Vision Conference,Vol 1, pp 87-96, 1995.
 A. Baumberg, Hierarchical shape fitting using an iterated linear filter, Proc of British Machine Vision Conference, Vol. 1, pp 313-323, BMVA, Sep 1996.
 A. Baumberg and D. Hogg, An Efficient Method for Contour Tracking using Active Shape Models , In IEEE Workshop on Motion of Non-rigid and Articulated Objects, IEEE Computer Society Press, pp 194-199,Nov 1994.
 N J B McFarlane and C P Schofield, Segmentation and Tracking of Piglets in Images, Machine Vision and Applications, Vol. 8, pp 187-193, 1995.
 D. Koller, J. Weber and J. Malik, Robust Multiple Car Tracking with Occlusion Reasoning, In European Conference on Computer Vision, pp 189-196, Vol. 1, May 1994.
 T.F. Cootes,C.J. Taylor, A. Lanitis, D. H. Cooper and J. Graham, Building and Using Flexible Models Incorporating Grey-Level Information, In International Conference on Computer Vision, pp 242-246, May 1993.
 F. Bookstein, Morphometric Tools for Landmark Data, Cambridge University Press, 1991.
 A. Blake, R. Curwen and A. Zisserman, A framework for spatio-temporal control in the tracking of visual contours, International Journal of Computer Vision, 1993.
 Proceedings of the Image Understanding Workshop, Vol. 1, Feb. 1996.
 H. Buxton, and G. Shaogang, Visual surveillance in a dynamic and uncertain world, Artificial Intelligence, Vol. 78, Issue 1-2, pp 431-459, October 1995.
 H-H Nagel, From image sequences towards conceptual descriptions, Image and Vision Computing, Vol. 6, 1988.