British Machine Vision Association and Society for Pattern Recognition

Understanding Visual Behaviour


Oral Presentations

Visual Models of Interaction

David Hogg
School of Computer Studies
University of Leeds, UK
It is widely believed computers will become easier to use when we can
communicate with them as we communicate with each other.  Synthetic
talking-heads provide one medium for achieving this, allowing familiar
visual and auditory modalities of communication.  This raises many
challenging problems in cognition and synthesis. One such challenge is
generating a natural and appropriate flow of facial expressions as an
interaction develops. This may be achieved at different cognitive levels. At
one extreme, participants in a conversation are capable of reacting to the
facial expressions and intonation of an active speaker without fully
comprehending the content of what is being said (or even the language used).

We describe a method for generating facial expressions and other behaviours
using a hidden Markov model for the ways in which the appearance of
individuals in an interaction can jointly develop. The models are learnt
automatically by observing typical interactions between real people.

A Connectionist System for Visually Mediated Interaction

Hilary Buxton and Jonathan Howell
COGS - Cognitive Science Department
University of Sussex, UK

In this talk/paper we introduce a set of effective connectionist techniques
for the various stages of visual communication mediated by computer which
could be used, for example, in video-conferencing applications.

First, we present the background to this work in the recognition of identity,
expression and pose using Radial Basis Function (RBF) networks. Flexible
example-based learning methods allow a set of specialised networks to be
trained, each one recognising specific information from common test data.

Second, we address the problem of gesture-based communication and the basic
capture of attentional frames. Colour/motion `blobs' are used to direct face
detection and the capture of `attentional frames', surrounding the upper torso
and head of the subjects, which focus the processing for visually mediated

Third, we present methods for the gesture recognition and behaviour
(user-camera) coordination in the system. We take an appearance-based approach
to allow specific phases of gestures to be represented as explicit time-delay
vectors of Gabor filter coefficients collected from the attentional frames.
This is used as input to RBF networks which can extract gesture and head pose
information which, in turn, is fed to a further network which can analyse
group behaviour in order to control the camera systems in an integrated

Finally, we discuss how these techniques can be extended to `virtual groups'
of multiple people interacting at multiple sites.

A Probabilistic Sensor for the Perception and the Recognition of Activities

Jim Crowley, Olivier Chomat and Jerome Martin
INRIA Rhones Alpes,
Grenoble, France

We present a new technique for the perception and recognition of activities using statistical
descriptions of their spatio-temporal properties.  A set of motion energy receptive fields is
designed in order to sample the power spectrum of a moving texture.  Their structure relates
to the spatio-temporal energy models of Adelson and Bergen where measures of local visual
motion information are extracted by comparing the outputs of a triad of Gabor energy filters.
Then the probability density function required for Bayes rule is estimated for each class of
activity by computing multi-dimensional histograms form the outputs from the set of receptive
fields.  The perception of activities is achieved according to Bayes rule.  The result at each instant
of time is the map of the conditional probabilities that each pixel belongs to each one of the
activities of the training set.  Since activities are perceived over a short integration time, a
temporal analysis of outputs is done using Hidden Markov Model.   The approach is validated with
experiments in the perception and recognition of activities of people walking in visual surveillance

A New Model for Visual Attention

Fred Stentiford
BTexaCT Research
Adastral Park, Martlesham, UK

The ability of higher animals to direct attention towards anomalies in
their environment has enormous survival value and it is not therefore
surprising that we humans possess an uncanny capacity to spot the
unusual without prior knowledge.  Existing models of visual attention
have provided plausible explanations for many of the standard percepts
and illusions and yet all have defied implementations that have led to
generic applications.  This paper describes a new measure of visual
attention and how it is applied to a range of images.

Simultaneous Perception and Classification of Visual Motion

Andrew Blake
Microsoft Research
Cambridge, England  UK

The importance in computer vision of learning shapes, patterns and
motions is becoming increasingly apparent. Here we report on some
theory and experiment concerning the learning of dynamics. The theory
is based on Expectation Maximisation (EM) and the Condensation
filter --- dubbed EM-C learning. The experiments concern
learning of complex motion in the domain of juggling. The learning
algorithm is computationally intensive, but the reward is that a
meaningful dynamical model is seen to emerge largely unsupervised from
training data. The model assists with accurate machine perception of
the motion and also delivers motion classification on the fly.

Visual Interaction Using Gesture and Behaviour

Shaogang Gong and Jamie Sherrah
Queen Mary College, University of London

In this work we address some underlying issues in developing a
real-time computer vision system that learns to model discontinuous
dynamics exhibited in the parameter space of live user actions and
behaviour. The parameters include head and hand positions, 3D head
pose and body gesture temporal structures. The system is required to
be robust (1) under low bandwidth video input (i.e. low resolution
and discontinuous dynamics), (2) using only view-based visual data
without camera calibration and (3) in uncontrolled environment with
cluttered background.

Action Reaction Learning for Predicting Interactive Behavior

Tony Jebara and Alex Pentland
MIT Media Lab
Massachusetts Institute of Technology
Cambridge, MA 02139,  USA

We propose Action-Reaction Learning (ARL) for modeling and acquiring
interactive human behavior autonomously. An ARL system starts by
observing the interactions of 2 real agents (i.e. 2 humans) using
audio-visual modalities. Humans are tracked in real-time and each
produces a time series of perceptual measurements. This temporal
series data is used to train a predictive model. Effectively, a
conditional distribution for regressing the future given a short term
memory of the past is formed. One possible choice for the
distribution is a mixture of experts model which we estimate via
maximum conditional likelihood (CEM algorithm).

Once a model is trained, a single human user interacts with the system
alone. The predictive model infers the most likely reactions to the
single user's actions and forecasts the missing agent's time
series. This is rendered as synthetic perceptual measurements. By
looping the process, we generate a real-time synthetic animation which
interactively responds to the user based on the behavior it has been
shown previously. In an unsupervised way, the system learns imitative
behavior and is demonstrated to respond appropriately to visual
gestures such as waving, clapping, etc.

In earlier work, the system tracked heads and hands using a mixture of
Gaussians for skin color and blob shapes. More recent work involves
ARL in a wearable computer platform which can be worn for extended
periods.  On board, 2 cameras and 2 microphones track the face and
words of the wearer as well as people in his field of view. The system
then learns to predict the wearer's audio-visual responses (reactions)
to stimuli that others present. Currently, we are considering
other probabilistic models and novel discriminative criteria that
emphasize prediction accuracy. One criterion is Maximum Entropy
Discrimination (MED), the probabilistic analog of Support Vector
Machines, which helps improve predictive power.

Markerless Motion Capture for Studio Production

Adrian Hilton
Centre for Vision, Speech and Signal Processing
University of Surrey, UK

Studio production of 3D content requires real-time capture of actor
movements and photo-realistic rendering of animated actor models. The
problem of markerless reconstruction of human motion is formulated as
the solution of the inverse-kinematics over multiple views. Initial
results demonstrate that this enables real-time reconstruction of
complex 3D movements. This presentation will discuss the limitations
of markerless motion capture and critical problems which must be
resolved to achieve reliable performance.

Poster Presentations

Automatic Gait Recognition via the Generalised Symmetry Operator

James B. Hayfron-Acquah, Mark S. Nixon and John N. Carter
Department of Electronics and Computer Science,
University of Southampton, UK

We describe a new method for automatic gait recognition based on analysing the
symmetry of human motion, by using the Generalised Symmetry Operator. This operator,
rather than relying on the borders of a shape or on its general appearance, is able to locate
features by their symmetrical properties. Essentially, it accumulates the symmetries
between image points to give a symmetry map. This approach is reinforced by the
psychologists' view that gait is a symmetrical pattern of motion. It is also supported by
work that suggests pendular motion is an appropriate model for automatic gait
The Fourier transform is used to derive the gait signatures from the symmetry map, in
view of its invariance properties and its descriptive capability. We applied our new
method to two different databases of four and seven sequences of four and of six subjects,
respectively. For both databases, we derived gait signatures for silhouette and optic flow
information. The results show that the symmetry properties of individuals' gait appear to
be unique and can indeed be used for recognition. We have so far achieved a Correct
Classification Rate exceeding 95% by the nearest neighbour rule with k=1 and k=3, and
that is a very promising start. The performance analysis also suggests that symmetry
enjoys practical advantages in recognition such as relative immunity to noise and missing
frames, with capability to handle occlusion.

New Area Measures for Automatic Gait Recognition

Jeff P. Foster, Adam Prugel-Bennett and Mark S. Nixon
Department of Electronics and Computer Science,
University of Southampton, UK

A biometric is a measure of some human characteristic that can be used to distinguish
between individuals. Gait is a new biometric aimed at recognising someone by the way
they walk.  It is not immediately apparent that gait is a unique biometric, however even
from Shakespearian times there has been reference to gait recognition. In The Tempest
[Act 4, Scene 1], Ceres observes "High'st Queen of state, Great Juno comes; I know her
by her gait?".  Gait has several notable advantages over other biometrics such as
fingerprints.  It is a non-invasive technique, meaning that the subject need not even know
they are being recognised.  Gait also allows recognition from a great distance where other
biometrics such as face recognition might fail.  A bank robber can disguise their gait less
easily than their face, in fact disguising one's gait only has the effect of making oneself
look more suspicious!
We describe a new technique for recognising gait, which we call gait masks.
Essentially, gait masks are used to derive information from a sequence of silhouettes.
This information is directly related to the gait of the subject.  The table below shows
example gait masks and silhouettes from the sequences to which they are applied. The
gait masks measure how the silhouette changes over time in a chosen region of the body.
These area changes are intimately related to the nature of gait.  Application shows the
sinusoidal nature of the output from a vertical line mask.  The peaks in the graph
correspond to when the legs are closest together, and the dips represent when the legs are
at furthest flexion.  Using Canonical Analysis it is possible to use this output for
recognition purposes.  Initial results are promising with a correct recognition rate of over
80% on a small database.  Future work will concentrate on combining results from
different gait masks with the aim of increasing the overall recognition rate.  In addition,
the performance of the system on a much larger database of subjects will be evaluated.

A Hierarchical Model of Dynamics for Tracking People with a Single Video Camera

I.Karaulova, P.Hall and D.Marshall
Department of Computer Science, University of Cardiff, UK
Department of Mathematical Sciences, University of Bath, UK

We are addressing the problem of tracking human motion in monocular video
sequences. Tracking humans in video has applications in many areas including
surveillance, computer games, films, and biodynamics. We anticipate the results
will ease the process of extracting 3D human motion from video and make it less expensive.

We propose a novel hierarchical model of human dynamics for view
independent tracking of the human body in monocular video
sequences. The model is trained using real data from a collection of
people. Kinematics are encoded using Hierarchical Principal
Component Analysis, and dynamics are encoded using Hidden Markov
Models.  The top of the hierarchy contains information about the
whole body. The lower levels of the hierarchy contain more detailed
information about possible poses of some subpart of the body. When
tracking, the lower levels of the hierarchy are shown to improve
accuracy. The model can be trained on either 2D or 3D data. It allows us
to recover 3D skeleton poses from 2D image sequences, to track unknown people,
and improve tracking accuracy.

We trained and tested our model on real-life data obtained using different
optical marker-based systems. The data describes walking motion of a number
of people in 2D or 3D. The experiments in 3D show that we are able to
recover the pose of a 3D skeleton on the basis of previously unseen 2D
image data, invariant of the camera view. The 2D experiments show that the system is
able to track both people it has been trained on and people it
has not been trained on. The main
contribution of our hierarchical model is the representation of minor
variations of a data set in a useful and compact manner, which
allows greater specificity while tracking.

Learning some geometric properties of a 3D scene

Dimitrios Makris and Tim Ellis
City University,
London UK

The work that is going to be presented is part of the research on surveillance and
monitoring systems consist of a network of CCTV cameras. Such a system requires
an initialization task that set some geometrical properties of the observed area. Although
these properties can be set manually or using specific calibration tasks, the idea of
learning them, just by using common observations of the scene, looks attractive. In addition,
such approach provides the system the ability to adapt to changes. Geometrical properties
that a surveillance multi-camera system may require are camera models, ground plane,
static occluding areas, normal size and shape of the observed targets, normal paths and
entry and exit points of the targets. Visual learning from observations should exploit the
target motion and the interaction between the targets and the static environment, using the
object correspondences among frames at the same camera and among cameras. This
presentation will focus in learning static occluding areas of the scene. Learning process uses
the derived motion tracking information and produces activity maps. In these activity maps,
occluding regions can be estimated, as well as common paths and entry and exit points.
For further information see: .



Tracking Drivers' Hand Movements

Gordon McAllister, Stephen McKenna and Ian Ricketts,
Department of Applied Computing, University of Dundee,
Scotland, UK

We present a computer vision system that detects and tracks a driver's
hands and forearms inside a vehicle without the use of coloured gloves,
markers or constrained backgrounds. This opens up a number of in-vehicle
applications including vehicle manoeuvre prediction, driver training,
driver fatigue detection and advanced driver-vehicle interfaces. We
focus on an interface in which the driver points at an in-vehicle
display. The tracker described here is used to localise regions for
detection of pointing fingers. Hands are detected by probabilistically
combining adaptive models of the driver's hands and arms (foreground)
and the vehicle interior (background). Adaptation allows illumination
changes and slight camera movement to be accounted for over time. The
background model holds statistics for each pixel except those inside the
steering wheel region. Foreground and steering wheel regions have their
colour distributions modelled. The system can handle changes of clothing
(gloves, long or short sleeves). In particular, a priori skin colour
models are not used. The foreground model is bootstrapped online using
initial segmentation results generated using the background model. The
combined models yield foreground membership probabilities. After
connected components analysis and size filtering, a Chamfer distance
transform is computed based on the connected component contours. This
distance transform is used to (i) decide how each pixel should be
weighted when used to update the adaptive foreground and background
models and (ii) provide 'smooth' image evidence for fitting 2D geometric
models of the hands and forearms. Geometric model fitting involves
maximising an objective function in order to simultaneously estimate
forearm orientation, hand position and scale. Model parameters are
tracked using Kalman filters. This 2D tracking process handles mutual
occlusion, resolving potentially ambiguous situations in which the hands
and forearms cross and uncross. The system currently runs at 15Hz on a
standard PC.