Hyeongwoo Kim1    Pablo Garrido2    Ayush Tewari1    Weipeng Xu1    Justus Thies3
Matthias Nießner3    Patrick Pérez2    Christian Richardt4    Michael Zollhöfer5    Christian Theobalt1

1 MPI Informatik       2 Technicolor       3 TU Munich       4 University of Bath       5 Stanford University

ACM Transactions on Graphics (Proceedings of SIGGRAPH 2018)


Abstract

We present a novel approach that enables photo-realistic re-animation of portrait videos using only an input video. In contrast to existing approaches that are restricted to manipulations of facial expressions only, we are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor. The core of our approach is a generative neural network with a novel space-time architecture. The network takes as input synthetic renderings of a parametric face model, based on which it predicts photo-realistic video frames for a given target actor. The realism in this rendering-to-video transfer is achieved by careful adversarial training, and as a result, we can create modified target videos that mimic the behavior of the synthetically-created input. In order to enable source-to-target video re-animation, we render a synthetic target video with the reconstructed head animation parameters from a source video, and feed it into the trained network – thus taking full control of the target. With the ability to freely recombine source and target parameters, we are able to demonstrate a large variety of video rewrite applications without explicitly modeling hair, body or background. For instance, we can reenact the full head using interactive user-controlled editing, and realize high-fidelity visual dubbing. To demonstrate the high quality of our output, we conduct an extensive series of experiments and evaluations, where for instance a user study shows that our video edits are hard to detect.

Downloads

Copyright

© Copyrights by the Authors, 2018. This is the authors’ version of the work. It is posted here for your personal use. Not for redistribution. The definitive version will be published in ACM Transactions on Graphics.

Please note: Important information about our work

Goal: Our aim is to demonstrate the capabilities of modern computer vision and graphics technology, and convey it in an approachable and fun way.

Context: We would like to emphasize that computer-generated videos have been part of feature-film movies for over 30 years. Virtually every high-end movie production contains a significant percentage of computer-generated imagery, or CGI, from Lord of the Rings to Benjamin Button. These results are hard to distinguish from reality and it often goes unnoticed that this content is not real. Thus, the synthetic modification of video clips was already possible for a long time, but this process was a time-consuming and required domain experts. The production of even a short synthetic video clip costs millions in budget and multiple months of work even for professionally trained artists, since they have to manually create and animate vast amounts of 3D content.

Progress: Over the last few years, approaches have been developed which enable the creation of realistic synthetic content based on much less input, e.g., based on a single video of a person or a collection of photos. With these approaches, much less work is required to synthetically create or modify a video clip. This makes these approaches, for the first time, accessible to a broader non-expert audience.

Applications: There are many possible positive use cases for our technology. One use case is for post-production in the film industry, for example for dubbing. Dubbing is an important post-production step that is used in filmmaking to replace the voice of the original actor by the voice of a dubbing actor speaking in another language. Production-level dubbing requires well-trained dubbers and extensive manual interaction. A good synchronization between speech and video is mandatory since viewers are very sensitive to discrepancies between the auditory and visual channel. Approaches such as ours enable to directly adapt the visual channel to the new audio track, which can help to reduce these discrepancies. We believe that our technique also might pave the way to live dubbing in a teleconferencing scenario.

Misuses: Unfortunately, besides the many positive use cases, such technology can also be misused. For example, the combination of photo-real synthesis of facial imagery with a voice impersonator or a voice synthesis system, would enable the generation of made-up video content that could potentially be used to defame people or to spread so-called ‘fake news’. Currently, the modified videos still exhibit many artifacts, which makes most forgeries easy to spot. It is hard to predict at what point in time such ‘fake’ videos will be indistinguishable from real content for our human eyes.

Implications: We believe that the capabilities of modern video modification approaches have to be openly discussed. We hope that the numerous demonstrations of our reenactment systems will make the general public aware of the capabilities of modern technology for video generation and editing. This will enable them to think more critically about the video content they consume every day, especially if there is no proof of origin.

Detection: The recently presented systems demonstrate the need for sophisticated fraud detection and watermarking algorithms. We believe that the field of digital forensics should and will receive a lot more attention in the future to develop approaches that can automatically prove or disprove the authenticity of a video clip. This will lead to new approaches that can spot such modifications even if we humans might not be able to spot them with our own eyes. We believe that more funding for research projects that aim at forgery detection is a first good step to tackle these challenges.

Bibtex

@article{DeepVideoPortraits,
  author    = {Hyeongwoo Kim and Pablo Garrido and Ayush Tewari and Weipeng Xu and Justus Thies and Matthias Nie{\ss}ner and Patrick P{\'e}rez and Christian Richardt and Michael Zollh{\"o}fer and Christian Theobalt},
  title     = {Deep Video Portraits},
  journal   = {ACM Transactions on Graphics},
  year      = {2018},
  volume    = {37},
  number    = {4},
  pages     = {163:1--14},
  month     = aug,
  issn      = {0730-0301},
  doi       = {10.1145/3197517.3201283},
  url       = {http://richardt.name/publications/deep-video-portraits/},
}