In addition to image generation, video generation faces many unique challenges. The temporal dimension introduces a multitude of possible changes between frames, and consistency and continuity may be disrupted. In this study, we go beyond simply evaluating actions and believe that generated videos should, like real-world videos, incorporate the emergence of new concepts and the transformation of their relationships over time.
To assess the temporal compositionality of video generation models, we propose TC-Bench, a benchmark of carefully crafted text prompts, corresponding real videos, and robust evaluation metrics. The prompts clarify the initial and final states of the scene, effectively reducing ambiguity in frame development and simplifying the assessment of transition completion. Furthermore, by collecting real-world videos aligned with the prompts, we extend the applicability of TC-Bench from text-conditional models to image-conditional models capable of generating frame interpolations.
We have also developed new metrics to measure the completeness of component transitions in generated videos, which correlate significantly higher with human judgment than existing metrics. Our comprehensive experimental results show that most video generators achieve less than 20% compositional variation, highlighting a vast room for improvement in the future. Our analysis indicates that current video generation models have difficulty interpreting descriptions of compositional changes and dynamically mapping different semantics at various temporal steps.