Visual-language models (VLMs) have achieved remarkable success in numerous multimodal tasks, a fact that is widely recognized. However, VLMs also face some non-negligible issues in practical applications. Typically, they are constrained by a limited context window, which means that when dealing with more complex multimodal tasks, the scope of contextual information that the model can refer to is relatively narrow. This may affect the model's ability to understand and process tasks.
In addition, when dealing with high-resolution image inputs and videos, VLMs also face the challenge of high computational costs. High-resolution images and videos contain a large amount of visual information, and processing this information requires substantial computational resources. This not only significantly increases computational costs but also slows down processing speed, severely affecting the model's effectiveness in practical applications.
To address these issues, visual compression technology has become an important breakthrough direction. Visual compression can alleviate the aforementioned problems by reducing the number of visual tokens. In previous studies, some methods used external modules to compress visual tokens and then forced language learning models (LLMs) to understand these compressed tokens. However, this approach has a significant drawback: visual information is lost during the compression process. This is because external modules may not fully retain the key features and details of the original visual information when compressing visual tokens, and when LLMs understand these incomplete compressed tokens, there may be inaccurate or incomplete understanding of visual information.
More critically, during the compression learning process, the understanding paradigm of LLMs for visual tokens has not been fully utilized. This means that existing methods have not fully exploited the potential and advantages of LLMs in understanding visual tokens, thus failing to achieve more efficient visual compression and better model performance.
In response to these issues, we propose a brand-new method - VoCo-LLaMA. This is the first method to use LLMs to compress visual tokens, which is of pioneering significance. In the specific implementation, we introduce visual compression tokens during the visual instruction tuning phase and cleverly use attention distillation technology. In this way, our method can distill the LLM's understanding process of visual tokens into its processing of VoCo tokens.
VoCo-LLaMA has multiple advantages. It can promote effective visual compression while also greatly improving computational efficiency during the inference phase. Specifically, our method achieves minimal performance loss with a compression ratio as high as 576x. At this high compression ratio, the number of floating-point operations (FLOPs) required for computation is reduced by 94.8%, and inference time is accelerated by 69.6%. This is of great significance for improving the practical application efficiency of the model, enabling it to be faster and more efficient when processing complex multimodal tasks.
Furthermore, by continuously training with temporal sequence compression token sequences using video frames, VoCo-LLaMA demonstrates a strong ability to understand temporal correlations. On popular video question answering benchmarks, VoCo-LLaMA outperforms previous methods. This fully proves the effectiveness and superiority of our method in handling video-related tasks.
Our method provides a very promising way to unleash the full potential of VLM context windows, enabling more scalable multimodal applications. Users can access the project page and related code at "this https URL" to further understand and use our method.