robot

Cambrian 1 - Vision Multimodal LLM

New York University has released Cambrian 1 - Vision Multimodal LLM, outperforming almost all other closed-source competitors with parameters ranging from 8-34B. The research uses LLMs and visual guidance adjustments as interfaces to evaluate various visual representations, providing new insights for different models and architectures—based on experiments conducted on over 20 visual encoders, including self-supervised, strongly supervised, or a combination of both. A critical review of existing MLLM benchmarks has been conducted, addressing the difficulties in integrating and interpreting results from various tasks, and introducing a new vision-centric benchmark, CV-Bench. This is aimed at further improving the visual foundation. The Spatial Visual Aggregator (SVA) has been proposed, a dynamic, spatially aware connector that integrates high-resolution visual features with LLMs while reducing the number of tokens.

article image

New York University has released a highly competitive Cambrian 1 - Vision Multimodal LLM. This model has demonstrated exceptional performance, outperforming almost all other closed-source competitors with parameter sizes ranging from 8 to 34 billion, showcasing its formidable strength.

During the research process, the research team adopted a unique approach. They used LLMs and visual guidance adjustments as interfaces to evaluate various visual representations. This method of evaluation provided new insights for different models and architectures. To obtain more comprehensive and accurate research results, the research team conducted extensive experiments based on over 20 visual encoders. These visual encoders included types such as self-supervised, strongly supervised, and a combination of both, making the research more in-depth and comprehensive.

At the same time, the research team conducted a critical review of existing MLLM benchmarks. They deeply recognized the difficulties faced in integrating and interpreting results from various tasks and proposed effective solutions to these issues. On this basis, they introduced a new vision-centric benchmark, CV-Bench. The introduction of this new benchmark is significant for further improving the visual foundation, providing a more scientific and reasonable evaluation standard for vision-related research.

Furthermore, to better achieve the integration of visual features with LLMs, the research team also proposed the Spatial Visual Aggregator (SVA). This is an innovative design, a dynamic connector with spatial awareness. It can effectively integrate high-resolution visual features with LLMs and cleverly reduce the number of tokens. This integration method not only improves the model's processing efficiency for visual information but also optimizes the overall performance of the model, providing new ideas and methods for the development of multimodal language models.