
Tencent: V-Express - Generating Talking Portrait Videos from Photos

Tencent has open-sourced a model called V-Express that generates videos from portrait photos. A simple method of sequential discarding operations is used to balance different control signals. Through our approach, weaker signals gradually come into play, enabling the generation that takes into account posture, input image, and audio simultaneously.

article image

In the field of portrait video generation, the practice of generating portrait videos from a single image is becoming increasingly common.

A common approach is to enhance the adapter with generative models for controlled generation. However, the strength of control signals varies, including text, audio, image reference, posture, depth maps, etc.

Among them, weaker conditions are often overshadowed by stronger conditions such as posture and original images, posing a challenge to balance these conditions. In our work on portrait video generation, we found that the audio signal is particularly weak and is often masked by stronger signals like posture and original images.

However, training directly with weak signals often leads to convergence difficulties.

To address this issue, we propose V-Express, a simple method that balances different control signals through a series of progressive discarding operations.

Our method gradually achieves effective control over weak conditions, enabling the generation that takes into account posture, input image, and audio simultaneously.

Experimental results show that our method can effectively generate portrait videos controlled by audio.

Furthermore, our method provides a potential solution for the simultaneous effective use of conditions of varying strengths.