Technology > Actuation > Vision-Based Navigation
In the context of autonomous navigation of multiple coordinated robots, vision plays a crucial role as it is a remote, passive, distributed sensor. Ultimately, interaction with unknown, complex, dynamic environment will require some form of visual sensory data processing, and the UCLA vision lab is investigating fundamental problems related to processing dynamic visual information for distributed control and interaction tasks.
Multi-view Descriptors:
Determining correspondence among the same locations in space in differing images is a classic problem in computer vision. A common approach to addressing this task involves computing functions of neighborhoods around points on single images, called descriptors. In situations where video sequences are available, such as robot navigation, correspondence reduces to tracking, which is simplified by the small changes in images between frames. We can exploit this small-baseline tracking to develop multi-view descriptors, which incorporate the observed variability in geometry and illumination.
Our research over the past year has focused on developing multiple view feature descriptors and applying them to structure from motion and navigation tasks. The descriptors are derived by tracking points of interest across multiple frames in video, rectifying their image neighborhoods (patches) according to some geometric transformation (translational, affine, projective), and learning the variability in the resultant image patches via kernel principal component analysis. Kernel PCA performs the same algorithm as classical PCA but in a higher dimensional space defined by a kernel function which computes inner products in that space. When a new patch is observed, its similarity to previously observed features can be computed by projecting its rectification onto the basis defined by the learned kernel principal components.
Tracking of articulated and deforming objects:
The ability to organize and integrate multimodal data derived from audio and visual sensors is crucial in almost all biological creatures to analyze their surrounding for critical survival decision. In particular, the necessity of sensor fusion arises in many situations when ambiguous auditory and visual information must be combined in order to support accurate perception. In the context of sensor networks, we have been working on the problem of modeling the dynamic relation between audio and video sensory data, with particular attention to applications in tracking of articulated and deforming objects. In this context, we considered the challenging task of modeling facial motion induced by speech. The problem is simple to state. We collect motion-capture data for an individual and the associated speech waveform, and from these data build a model that can be used to generate novel synthetic facial motions associated with novel speech segments, for instance of an animated character.
Our approach is to model the face using decoupled shape and radiance elements. The shape element is defined by a number of salient points on the face that are photometrically distinct and can be reliably tracked across image frames. Deformations of the face image can then be described in terms of warping of the regions defined by the shape points.
Multi-view Descriptors:
We have developed experimental code to test our algorithms in a variety of scenarios and compared it to existing techniques. Figure 11 shows the results of a matching experiment involving changes in the geometry and pose of a flexible object. In figure 12, we compare the multi-view technique to the scale invariant feature transform method applied to a curved object. Figure 13 shows the results of a robot navigation experiment assisted by the multi-view feature descriptor; the circular curve indicates the location of the camera, and the thin lines its direction.
Tracking of articulated and deforming objects:
As preliminary experiments show, there is a strong correlation between the speech signal and the dynamics of a number of salient features. Figure 14 (left) depicts a plot of the correlation between speech signal and feature trajectories (both reduced to scalar signals via PCA). It is evident how the motion of the points around the mouth is strongly correlated to the speech input. We have shown how this dynamic relation can be captured by a linear dynamical system made up of two parts, a deterministic component driven by the speech waveform and a stochastic part driven by non-Gaussian noise. The rationale is that facial motion is the result of word utterances combined with physical characteristics of the face that are peculiar to each individual. Our goal is to decouple these two factors in our model, so that we can drive an individual with arbitrary speech sequences while retaining his/her distinctive character. In this sense, is in the vein of separating "style" and "content", but in a dynamic context. While the dynamics of facial motion can be faithfully modeled with a linear model with speech data as input, in order to model the subtleties associated to each individual we allow for a stochastic input drawn from a non-Gaussian distribution. Despite the linear structure, this model does not fall in the standard form suitable for applying off-the-shelf system identification algorithms, due to the decoupled structure of the input-to-state relationship and the non-Gaussian nature of the stochastic input. Under quite general assumptions, we derive an optimal system identification procedure for the identification of the model parameters. The efficacy of the model in capturing the complexities of this time-dependent and multi-modal data can be visually inspected from the facial image sequences generated from speech data (Figure 14 – right)
FACULTY
Prof. Stefano Soatto
STUDENTS
Alessandro Bissacco
Jason Meltzer