New York University has released a highly competitive Cambrian 1 - Vision Multimodal LLM. This model has demonstrated exceptional performance, outperforming almost all other closed-source competitors with parameter sizes ranging from 8 to 34 billion, showcasing its formidable strength.
During the research process, the research team adopted a unique approach. They used LLMs and visual guidance adjustments as interfaces to evaluate various visual representations. This method of evaluation provided new insights for different models and architectures. To obtain more comprehensive and accurate research results, the research team conducted extensive experiments based on over 20 visual encoders. These visual encoders included types such as self-supervised, strongly supervised, and a combination of both, making the research more in-depth and comprehensive.
At the same time, the research team conducted a critical review of existing MLLM benchmarks. They deeply recognized the difficulties faced in integrating and interpreting results from various tasks and proposed effective solutions to these issues. On this basis, they introduced a new vision-centric benchmark, CV-Bench. The introduction of this new benchmark is significant for further improving the visual foundation, providing a more scientific and reasonable evaluation standard for vision-related research.
Furthermore, to better achieve the integration of visual features with LLMs, the research team also proposed the Spatial Visual Aggregator (SVA). This is an innovative design, a dynamic connector with spatial awareness. It can effectively integrate high-resolution visual features with LLMs and cleverly reduce the number of tokens. This integration method not only improves the model's processing efficiency for visual information but also optimizes the overall performance of the model, providing new ideas and methods for the development of multimodal language models.