Voice Puppetry

The voice puppet can animate any face using just your voice. It uses expressive information in a voice-track to control the entire face, from lips to eyebrows, neck to hairline. The mapping from vocal to facial gestures is learned from vision of real facial behavior and automatically incorporates long-term vocal and facial dynamics such as co-articulation. The animated face can be a 2D cartoon, a 3D model, or even a photo.     Voice puppetry is intended to replace tedious and expensive methods currently used in cartoon animation, film special effects, and video post-production. In addition, it should create new opportunities for realistic facial motion in video games and computer entertainments.

Background & Objective:  Nearly all facial animation systems begin with a stream of phonemes (basic sound tokens), usually obtained by hand, from text, or, less successfully, from speech recognition. Typically, each phoneme is mapped to a viseme (facial pose) which are interpolated to produce an animation. It is widely understood that phonemes and visemes are inadequate because they discard information about expression and emotion in the whole-face gesture. Even for the limited problem of lip-syncing, phonemes and interpolation lead to problems with unnatural facial dynamics.

Technical Discussion:  We learn the natural dynamics of the whole face by modeling the motions of facial with an entropically estimated hidden Markov model. Entropic estimation produces a compact, sparse, and minimally ambiguous state machine, essentially discovering key facial states, dynamics, and timing. Given a new vocal track, the system calculates a trajectory through facial configuration space that is maximally compatible with the learned facial dynamics and with the newly observed acoustic features. The resulting trajectory can be used to drive a variety of animations, ranging from 2D cartoons to 3D computer graphics to 2.5D image warps. Our current system animates a texture-mapped 3D model, producing a surprisingly good illusion of realism. We have also used the learned model to animate non-human heads and for extremely low bit-rate facial motion coding (as low as 4 bits/frame!).

Contact:  Matthew Brand

Technology Areas:
Graphics
Artificial Intelligence

Modification Date:  September 12, 2007