Skip Header NavigationIntranet 
CENTER FOR EMBEDDED NETWORKED SENSINGContactDirectionsEmploymentEventsNews
HomeAbout UsResearchEducationResourcesPeople

Research Project


Vision-Based Navigation

Technology > Actuation > Vision-Based Navigation

On this page: Overview | Approach | Systems/Experiments | Accomplishments | Future Directions | People

OVERVIEW

In the context of autonomous navigation of multiple coordinated robots, vision plays a crucial role as it is a remote, passive, distributed sensor. Ultimately, interaction with unknown, complex, dynamic environment will require some form of visual sensory data processing, and the UCLA vision lab is investigating fundamental problems related to processing dynamic visual information for distributed control and interaction tasks.

APPROACH

Multi-view Descriptors:
Determining correspondence among the same locations in space in differing images is a classic problem in computer vision. A common approach to addressing this task involves computing functions of neighborhoods around points on single images, called descriptors. In situations where video sequences are available, such as robot navigation, correspondence reduces to tracking, which is simplified by the small changes in images between frames. We can exploit this small-baseline tracking to develop multi-view descriptors, which incorporate the observed variability in geometry and illumination.

Our research over the past year has focused on developing multiple view feature descriptors and applying them to structure from motion and navigation tasks. The descriptors are derived by tracking points of interest across multiple frames in video, rectifying their image neighborhoods (patches) according to some geometric transformation (translational, affine, projective), and learning the variability in the resultant image patches via kernel principal component analysis. Kernel PCA performs the same algorithm as classical PCA but in a higher dimensional space defined by a kernel function which computes inner products in that space. When a new patch is observed, its similarity to previously observed features can be computed by projecting its rectification onto the basis defined by the learned kernel principal components.

Tracking of articulated and deforming objects:
The ability to organize and integrate multimodal data derived from audio and visual sensors is crucial in almost all biological creatures to analyze their surrounding for critical survival decision. In particular, the necessity of sensor fusion arises in many situations when ambiguous auditory and visual information must be combined in order to support accurate perception. In the context of sensor networks, we have been working on the problem of modeling the dynamic relation between audio and video sensory data, with particular attention to applications in tracking of articulated and deforming objects. In this context, we considered the challenging task of modeling facial motion induced by speech. The problem is simple to state. We collect motion-capture data for an individual and the associated speech waveform, and from these data build a model that can be used to generate novel synthetic facial motions associated with novel speech segments, for instance of an animated character.

Our approach is to model the face using decoupled shape and radiance elements. The shape element is defined by a number of salient points on the face that are photometrically distinct and can be reliably tracked across image frames. Deformations of the face image can then be described in terms of warping of the regions defined by the shape points.

SYSTEMS / EXPERIMENTS

Multi-view Descriptors:
We have developed experimental code to test our algorithms in a variety of scenarios and compared it to existing techniques. Figure 11 shows the results of a matching experiment involving changes in the geometry and pose of a flexible object. In figure 12, we compare the multi-view technique to the scale invariant feature transform method applied to a curved object. Figure 13 shows the results of a robot navigation experiment assisted by the multi-view feature descriptor; the circular curve indicates the location of the camera, and the thin lines its direction.

Figure 11 - Matching

Figure 12 - Application to curved object

Figure 13 - Robot Navigation Experiment

Tracking of articulated and deforming objects:
As preliminary experiments show, there is a strong correlation between the speech signal and the dynamics of a number of salient features. Figure 14 (left) depicts a plot of the correlation between speech signal and feature trajectories (both reduced to scalar signals via PCA). It is evident how the motion of the points around the mouth is strongly correlated to the speech input. We have shown how this dynamic relation can be captured by a linear dynamical system made up of two parts, a deterministic component driven by the speech waveform and a stochastic part driven by non-Gaussian noise. The rationale is that facial motion is the result of word utterances combined with physical characteristics of the face that are peculiar to each individual. Our goal is to decouple these two factors in our model, so that we can drive an individual with arbitrary speech sequences while retaining his/her distinctive character. In this sense, is in the vein of separating "style" and "content", but in a dynamic context. While the dynamics of facial motion can be faithfully modeled with a linear model with speech data as input, in order to model the subtleties associated to each individual we allow for a stochastic input drawn from a non-Gaussian distribution. Despite the linear structure, this model does not fall in the standard form suitable for applying off-the-shelf system identification algorithms, due to the decoupled structure of the input-to-state relationship and the non-Gaussian nature of the stochastic input. Under quite general assumptions, we derive an optimal system identification procedure for the identification of the model parameters. The efficacy of the model in capturing the complexities of this time-dependent and multi-modal data can be visually inspected from the facial image sequences generated from speech data (Figure 14 – right)

Figure 14 - Correlation between Speech signal and feature trajectories

ACCOMPLISHMENTS

FUTURE DIRECTIONS

PEOPLE

FACULTY

Prof. Stefano Soatto

STUDENTS

Alessandro Bissacco
Jason Meltzer