With the maturation of video generation models, the study of video control methods has gained increasing importance.
The research from the Shanghai Artificial Intelligence Laboratory can clone motions from reference videos to control videos generated from text. The demonstration shows promising results, with no apparent contamination due to the style or content of the original video.
A temporal attention mechanism is employed to represent actions in reference videos during the video inversion process and to introduce a primary temporal attention guide to mitigate the effects of noise or subtle motions in the attention weights.
A position-aware semantic guidance mechanism is proposed, which utilizes the rough location of the foreground in the reference video and unclassified guidance features to steer video generation.