Learning Image-Based VR Facial Animation and Face Reenactment
Learning Image-Based VR Facial Animation and Face Reenactment

| dc.contributor.advisor | Behnke, Sven | |
| dc.contributor.author | Rochow, Andre | |
| dc.date.accessioned | 2025-12-05T14:43:01Z | |
| dc.date.available | 2025-12-05T14:43:01Z | |
| dc.date.issued | 05.12.2025 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.11811/13722 | |
| dc.description.abstract | The field of facial animation deals with the manipulation of facial representations, such as images or meshes, primary with the objective of generating a natural and consistent animation sequence. This thesis focuses on images and presents methods for Virtual Reality (VR) facial animation and face reenactment. Facial animation in virtual reality environments (VR facial animation) is essential for applications that necessitate clear visibility of the user’s face and the ability to convey emotional signals and expressions. The primary challenge is to reconstruct the complete face of an individual utilizing a head-mounted display (HMD). In our case, all information relevant for the animation is obtained from one mouth camera mounted below the HMD and two eye cameras inside the HMD. The principal use case for our methods is to animate the face of an operator who controls our robotic avatar system at the ANA Avatar XPRIZE competition. For the semifinals, we initially propose a real-time capable pipeline with very fast adaptation for specific operators. The method can be trained on talking-head datasets and generalizes to unseen operators, while requiring only a quick enrollment step, during which two short videos are captured. The first video is a sequence of source images from the operator without the VR headset which contain all the important operator-specific appearance information. During inference, we then use the operator keypoint information extracted from a mouth camera and two eye cameras to estimate the target expression, to which we map the appearance of a source still image. In order to enhance the mouth expression accuracy, we dynamically select an auxiliary expression frame from the captured sequence. This selection is done by learning to transform the current mouth keypoints into the source image space, where the alignment can be determined accurately. Based on this method, we propose an extension that was used in the ANA Avatar XPRIZE finals. We significantly improve the temporal consistency and animation accuracy. In addition, we are able to represent a much broader range of facial expressions by resolving kepoint ambiguities occurring in our method used in the semifinals. Purely keypoint-driven animation approaches struggle with the complexity of facial movements. We present a hybrid method that uses both keypoints and direct visual guidance from a mouth camera. Instead of using only one source image, multiple source images are selected with the intention to cover different facial expressions. We employ an attention mechanism to determine the importance of each source image. To resolve keypoint ambiguities and animate a broader range of mouth expressions, we propose to inject visual mouth camera information into the latent space. We enable training on large-scale talking-head datasets by simulating the mouth camera input with its perspective differences and facial deformations. We then approach the task of face reenactment, which involves transferring the head motion and facial expressions from a facial driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame. After deforming the source image into the driving frame, it is inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluate our approach in a randomized user study. The results indicate superior performance compared to previous state-of-the-art methods in terms of motion transfer quality and temporal consistency. Finally, we demonstrate in a separate experiment that the method can be adapted for the VR facial animation task, while simultaneously reducing the preprocessing time significantly in comparison to our previous approaches. | en |
| dc.language.iso | eng | |
| dc.rights | In Copyright | |
| dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | |
| dc.subject | Facial Animation | |
| dc.subject | VR Facial Animation | |
| dc.subject | Face Reenactment | |
| dc.subject | Motion Transfer | |
| dc.subject.ddc | 004 Informatik | |
| dc.title | Learning Image-Based VR Facial Animation and Face Reenactment | |
| dc.type | Dissertation oder Habilitation | |
| dc.publisher.name | Universitäts- und Landesbibliothek Bonn | |
| dc.publisher.location | Bonn | |
| dc.rights.accessRights | openAccess | |
| dc.identifier.urn | https://nbn-resolving.org/urn:nbn:de:hbz:5-86712 | |
| dc.relation.arxiv | 2304.12051 | |
| dc.relation.arxiv | 2312.09750 | |
| dc.relation.arxiv | 2404.09736 | |
| dc.relation.doi | https://doi.org/10.1109/IROS47612.2022.9981892 | |
| dc.relation.doi | https://doi.org/10.1109/IROS55552.2023.10342522 | |
| dc.relation.doi | https://doi.org/10.1109/CVPR52733.2024.00737 | |
| ulbbn.pubtype | Erstveröffentlichung | |
| ulbbnediss.affiliation.name | Rheinische Friedrich-Wilhelms-Universität Bonn | |
| ulbbnediss.affiliation.location | Bonn | |
| ulbbnediss.thesis.level | Dissertation | |
| ulbbnediss.dissID | 8671 | |
| ulbbnediss.date.accepted | 24.11.2025 | |
| ulbbnediss.institute | Mathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik | |
| ulbbnediss.fakultaet | Mathematisch-Naturwissenschaftliche Fakultät | |
| dc.contributor.coReferee | Blanz, Volker | |
| ulbbnediss.contributor.orcid | https://orcid.org/0000-0003-1536-6219 |
Dateien zu dieser Ressource
Das Dokument erscheint in:
-
E-Dissertationen (4442)




