Meta: Running Large Language Models (LLMs) with Less Than One Billion Parameters on Mobile Devices

Meta's new paper discusses how to run Large Language Models (LLMs) with less than one billion parameters on mobile devices. They propose a series of methods to significantly enhance model performance while maintaining a small model size on mobile devices. Key points of the paper include: For small models, depth is more important than width, adopting a "deep and narrow" architectural design. Techniques such as embedding sharing and grouped query attention can improve model parameter utilization. A method for weight sharing between adjacent blocks is proposed, further enhancing performance without increasing the model size.

Meta has released an influential new paper focusing on how to efficiently run Large Language Models (LLMs) with less than one billion parameters on mobile devices. With the widespread application of mobile devices in people's daily lives, how to achieve high-performance language model operation on such devices has become a key issue to be resolved. Meta's research team has proposed a series of innovative methods in this paper, which can significantly enhance model performance while maintaining a small model size on mobile devices, bringing new breakthroughs to language processing applications on mobile intelligent devices.

The core points of the paper cover several important aspects:

Firstly, in terms of model architectural design, for small models, the research found that depth is more important than width. Therefore, the team adopted the design concept of a "deep and narrow" architecture. This design approach is different from some traditional model architectures. Traditionally, there might be more emphasis on the width of the model, i.e., increasing the number of neurons or the model's dimensions. However, through in-depth research and experimental validation, Meta's team found that for small LLMs with less than one billion parameters, increasing the depth of the model can more effectively utilize limited parameter resources and explore the model's expressive power. The "deep and narrow" architecture allows the model to better capture the hierarchical structure and semantic information of language when processing various natural language tasks, thereby improving the model's performance.

Secondly, the paper proposes using advanced techniques such as embedding sharing and grouped query attention to improve model parameter utilization. Embedding sharing is an effective technical means that reduces redundancy in model parameters by sharing certain embedding vectors, allowing limited parameters to be more focused on representing key language information. The grouped query attention mechanism is an optimization of the traditional attention mechanism. In traditional attention mechanisms, the computational complexity is often high, especially for small models on mobile devices, which may consume too many computing resources. Grouped query attention reduces computational complexity by grouping queries and computing attention weights selectively, ensuring that the model can accurately focus on key information when processing text, thereby improving the efficiency of model parameter utilization and enhancing the overall performance of the model.

Lastly, the research team also proposed a method for weight sharing between adjacent blocks. This method is a key innovation for further enhancing model performance without increasing the model size. In general model structures, each block has its own independent weight parameters, which to some extent increases the number of model parameters. By sharing weights between adjacent blocks, the model can reuse certain weights between different blocks, allowing the model to learn richer language patterns and features while maintaining a smaller size. This weight-sharing method is similar to a knowledge transfer mechanism, enabling different parts of the model to work together better, improving the model's understanding and processing of language, and thereby further enhancing the model's performance on mobile devices.