Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier

https://orcid.org/0000-0002-5070-6330

AccessType

Open Access Dissertation

Document Type

dissertation

Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded

2021

Month Degree Awarded

September

First Advisor

Evangelos Kalogerakis

Subject Categories

Artificial Intelligence and Robotics | Graphics and Human Computer Interfaces

Abstract

Generating believable character animations is a fundamentally important problem in the field of computer graphics and computer vision. It also has a diverse set of applications ranging from entertainment (e.g., films, games), medicine (e.g., facial therapy and prosthetics), mixed reality, and education (e.g., language/speech training and cyber-assistants). All these applications are all empowered by the ability to model and animate characters convincingly (human or non-human). Existing key-framing or performance capture approaches used for creating animations, especially facial animations, are either laborious or hard to edit. In particular, producing expressive animations from input speech automatically remains an open challenge. In this thesis, I propose novel deep-learning based approaches to produce speech audio-driven character animations, including talking-head animations for character face rigs and portrait images, and reenacted gesture animations for natural human speech videos. First, I propose a neural network architecture, called VisemeNet, that can automatically animate an input face rig using audio as input. The network has three stages: one that learns to predict a sequence of phoneme-groups from audio; another that learns to predict the geometric location of important facial landmarks from audio; and a final stage that combines the outcome from previous stages to produce animation motion curves for FACS-based (Facial Action Coding System-based) face rigs. Second, I propose MakeItTalk, a method that takes as input a portrait image of a face along with audio, and produces the expressive synchronized talking-head animation. The portrait image can range from artistic cartoons to real human faces. In addition, the method generates the whole head motion dynamics matching the audio stresses and pauses. The key insight of the method is to disentangle the content and speaker identity in the input audio signals, and drive the animation from both of them. The content is used for robust synchronization of lips and nearby facial regions. The speaker information is used to capture the rest of the facial expressions and head motion dynamics that are important for generating expressive talking head animations. I also show that MakeItTalk can generalize to new audio clips and face images not seen during training. Both VisemeNet and MakeItTalk lead to much more expressive talking-head animations with higher overall quality compared to the state-of-the-art. Lastly, I propose a method that generates speech gesture animation by reenacting a given video to match a target speech audio. The key idea is to split and re-assemble clips from an existing reference video through a novel video motion graph encoding valid transitions between clips. To seamlessly connect different clips in the reenactment, I propose a pose-aware video blending network which synthesizes video frames around the stitched frames between two clips. Moreover, the method incorporates an audio-based gesture searching algorithm to find the optimal order of the reenacted frames. The method generates reenactments that are consistent with both the audio rhythms and the speech content. The resulting synthesized videos have much higher quality and consistency with the target audio compared to previous work and baselines.

DOI

https://doi.org/10.7275/23710166

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS