Loading...
Thumbnail Image
Publication

Audio-driven Character Animation

Citations
Altmetric:
Abstract
Generating believable character animations is a fundamentally important problem in the field of computer graphics and computer vision. It also has a diverse set of applications ranging from entertainment (e.g., films, games), medicine (e.g., facial therapy and prosthetics), mixed reality, and education (e.g., language/speech training and cyber-assistants). All these applications are all empowered by the ability to model and animate characters convincingly (human or non-human). Existing key-framing or performance capture approaches used for creating animations, especially facial animations, are either laborious or hard to edit. In particular, producing expressive animations from input speech automatically remains an open challenge. In this thesis, I propose novel deep-learning based approaches to produce speech audio-driven character animations, including talking-head animations for character face rigs and portrait images, and reenacted gesture animations for natural human speech videos. First, I propose a neural network architecture, called VisemeNet, that can automatically animate an input face rig using audio as input. The network has three stages: one that learns to predict a sequence of phoneme-groups from audio; another that learns to predict the geometric location of important facial landmarks from audio; and a final stage that combines the outcome from previous stages to produce animation motion curves for FACS-based (Facial Action Coding System-based) face rigs. Second, I propose MakeItTalk, a method that takes as input a portrait image of a face along with audio, and produces the expressive synchronized talking-head animation. The portrait image can range from artistic cartoons to real human faces. In addition, the method generates the whole head motion dynamics matching the audio stresses and pauses. The key insight of the method is to disentangle the content and speaker identity in the input audio signals, and drive the animation from both of them. The content is used for robust synchronization of lips and nearby facial regions. The speaker information is used to capture the rest of the facial expressions and head motion dynamics that are important for generating expressive talking head animations. I also show that MakeItTalk can generalize to new audio clips and face images not seen during training. Both VisemeNet and MakeItTalk lead to much more expressive talking-head animations with higher overall quality compared to the state-of-the-art. Lastly, I propose a method that generates speech gesture animation by reenacting a given video to match a target speech audio. The key idea is to split and re-assemble clips from an existing reference video through a novel video motion graph encoding valid transitions between clips. To seamlessly connect different clips in the reenactment, I propose a pose-aware video blending network which synthesizes video frames around the stitched frames between two clips. Moreover, the method incorporates an audio-based gesture searching algorithm to find the optimal order of the reenacted frames. The method generates reenactments that are consistent with both the audio rhythms and the speech content. The resulting synthesized videos have much higher quality and consistency with the target audio compared to previous work and baselines.
Type
dissertation
Date
2021-09
Publisher
License
License
http://creativecommons.org/licenses/by/4.0/