MimicMotion: Transforming Images into Dancing Videos with Motion Control

Tencent has released a project called MimicMotion that generates dancing videos from images. The effect appears to be much better than Alibaba's, and it supports both facial features and lip-syncing. It's not only capable of creating dancing videos but also digital humans.

Tencent has made new progress in the field of artificial intelligence with the release of a remarkable project called MimicMotion, which generates dancing videos from images. The showcased results are significantly better than those of similar projects by Alibaba. This project has a variety of powerful features; it supports facial features and lip-syncing simultaneously, making the generated videos more vivid and realistic. Moreover, its applications are extensive, not only for making dancing videos but also for creating digital humans, providing more possibilities for digital content creation.

Tencent has adopted a series of innovative technologies in this project, optimizing Alibaba's solution in many aspects.

Firstly, they introduced a confidence-based pose guidance mechanism. In the video generation process, the continuity of posture is crucial for the quality of the video. With this mechanism, the generation of the video can be guided according to the confidence level of each pose, ensuring that the generated video is more coherent and smooth over time. For instance, when generating a dancing video, each dance move of the character can transition naturally, without abrupt jumps or stutters, making the entire video look like a professional dancer performing in real life.

Secondly, they developed a confidence-based regional loss amplification technology. During the video generation process, image distortion and deformation are common issues that can severely affect the quality of the video. This technology can more accurately identify and correct distortion and deformation issues in the image by amplifying the confidence-based regional loss. As a result, the images of characters and scenes in the generated video are clearer and more accurate, avoiding visual defects caused by distortion and deformation, significantly enhancing the visual effect of the video.

Lastly, they proposed an innovative progressive fusion strategy. When generating videos, the consumption of computing resources is an important factor to consider. Especially for longer videos, how to ensure video quality while reasonably controlling the consumption of computing resources is a key issue. Tencent's progressive fusion strategy effectively solves this problem, allowing the generation of videos of any length with acceptable computing resource consumption. Whether it's generating short videos of a few seconds or longer videos of minutes or even longer, this strategy can efficiently complete the task, providing users with more flexible and convenient video generation services.