The field of portrait image animation driven by speech audio input has made significant progress in generating realistic dynamic portraits.
This study delves into the complexity of synchronizing facial movements and creating visually appealing, temporally consistent animations within a diffusion-based method framework.
Our innovative approach breaks away from the traditional paradigm of relying on parametric models for intermediate facial representations, adopting an end-to-end diffusion paradigm, and introduces a layered audio-driven visual synthesis module to improve the alignment accuracy between audio input and visual output, including lip, expression, and posture movements.
The proposed network architecture seamlessly integrates diffusion-based generative models, UNet-based denoisers, temporal alignment techniques, and reference networks.
The proposed layered audio-driven visual synthesis provides adaptive control over the diversity of expressions and postures, enabling more effective personalization for different identities.
Through a comprehensive evaluation combining qualitative and quantitative analysis, our method shows a significant enhancement in image and video quality, lip-sync accuracy, and motion diversity.