Multimodal LLM InternLM-XComposer-2.5

AI lab has open-sourced a very powerful multimodal LLM InternLM-XComposer-2.5. It supports ultra-high-resolution image understanding, fine-grained video understanding, and multi-round image dialogue. Additionally, it has been optimized specifically for web page creation and mixed text and image articles.

Long context processing: IXC-2.5 natively supports input with up to 24K tokens, expandable to 96K, capable of handling extremely long text and image inputs.

Diversified visual capabilities: It supports ultra-high-resolution image understanding, fine-grained video understanding, and multi-round multi-image dialogue.

Other features: It can generate web pages and high-quality articles, combining text and images.

Model architecture: It includes a lightweight visual encoder, a large language model, and partial LoRA alignment.

Test results: In 28 benchmark tests, it outperformed open-source models in 16 categories, and was close to or exceeded GPT-4V and Gemini Pro in 16 categories.