robot

Follow-Your-Emoji: Tencent's Research on Generating Talking Face Videos

Instead of audio-driven generation, they have achieved facial expression transfer, which can map anyone's facial expressions onto a corresponding photo to generate a video. This way, it is not only possible to generate videos of people talking, but also to synchronize expressions well even without sound. Including real people, cartoons, sculptures, and even animals, the transfer can be well executed.

article image


To address these challenges, Follow-Your-Emoji equips a powerful stable diffusion model with two carefully designed techniques.

Specifically, we first employ a novel explicit motion signal, namely expression-aware landmarks, to guide the animation process.

We find that this landmark not only ensures accurate motion alignment between the reference portrait and target motion during inference but also enhances the ability to depict exaggerated expressions (i.e., large pupil movements) and avoids identity leakage.

Then, we propose a facial fine-grained loss to improve the model's perception of subtle expressions and its ability to reconstruct the appearance of reference portraits by using expressions and facial masks.

Consequently, our method exhibits significant performance in controlling the expressions of freehand portraits, including real humans, cartoons, sculptures, and even animals.

By leveraging a simple and effective progressive generation strategy, we extend the model to stable long-term animation, thereby increasing its potential application value.

To address the lack of benchmarks in this field, we introduce \textbf{EmojiBench}, a comprehensive benchmark that includes various portrait images, driving videos, and landmarks. We conduct extensive evaluations on EmojiBench to validate the superiority of Follow-Your-Emoji.