robot

Meta Delivers on Promise with the Release of Llama 3.1 405B Model

Meta has officially released the Llama 3.1 model last week, as promised, and it includes the 8B, 70B, and 405B models, consistent with the leaked content. It offers improved inference capabilities, a larger 128K token context window, and enhanced support for eight languages, among other improvements. The 405B model can compete with leading closed-source models across a variety of tasks. The license has also been updated to allow developers to use the output of Llama models, including the 405B, to improve other models. The 405B is indeed crucial for Meta, so much so that Mark Zuckerberg even released a statement to introduce the advantages of Meta's open-source approach. He believes that open-source artificial intelligence (like Llama 3.1) is the right path for the future, as it can promote the broader application and innovation of AI technology, while also helping Meta maintain a leading position in technology and the sustainability of its business model.

So far, open-source large language models have mostly lagged behind closed models in terms of functionality and performance. Now, we are ushering in a new era led by open-source. We have publicly released Meta Llama 3.1 405B, which we believe to be the largest and most powerful publicly available foundational model in the world. To date, the total number of downloads across all Llama versions has exceeded 300 million, and this is just the beginning.

Introducing Llama 3.1

Llama 3.1 405B is the first publicly available model that can match top AI models in advanced capabilities such as common sense, manipulability, mathematics, tool use, and multilingual translation. With the release of the 405B model, we are ready to enhance innovation and provide unprecedented opportunities for growth and exploration. We believe that the latest generation of Llama will inspire new applications and modeling paradigms, including synthetic data generation to improve and train smaller models, as well as model distillation, a feature never before realized on such a large scale in open-source.

As part of the latest version, we have released upgraded versions of the 8B and 70B models. These models support multiple languages, have significantly increased context length up to 128K, and use state-of-the-art tools with stronger inference capabilities. This allows our latest models to support advanced use cases such as long-form text summarization, multilingual conversational agents, and coding assistants. We have also changed the license to allow developers to use the output of Llama models (including 405B) to improve other models. In fulfillment of our commitment to open-source, starting today, we are making these models available to the community for download on llama.meta.com and Hugging Face, and can be immediately developed on our extensive partner platform ecosystem.

Model Evaluation

For this version, we have evaluated performance on more than 150 benchmark datasets covering multiple languages. In addition, we have conducted extensive human evaluations, comparing Llama 3.1 with competitive models in real-world scenarios. Our experimental evaluations indicate that our flagship model is on par with leading foundational models across a range of tasks, including GPT-4, GPT-4o, and Claude 3.5 Sonnet. Moreover, our smaller models are on par with closed and open models with similar parameter counts.

Model Architecture

As our largest model to date, training Llama 3.1 405B on over 15 trillion tokens is a significant challenge. To be able to train at this scale and achieve results in a reasonable time, we have significantly optimized the entire training stack and pushed model training to over 16,000 H100 GPUs, making 405B the first Llama model trained at this scale.

To address this, we made design choices that focused on maintaining the scalability and simplicity of the model development process.

We opted for a standard decoder-only transformer model architecture with minor modifications, rather than a mixture of expert models, to maximize training stability.

We employed an iterative post-training procedure, with each round using supervised fine-tuning and direct preference optimization. This allowed us to create the highest quality synthetic data for each round and improve the performance of each feature.

Compared to previous Llama versions, we have increased the quantity and quality of data used for pre-training and post-training. These improvements include developing more careful preprocessing and management processes for pre-training data, developing stricter quality assurance, and filtering methods for post-training data.

As expected by the laws of language model scaling, our new flagship model performs better than smaller models trained with the same procedure. We also used the 405B parameter model to improve the post-training quality of smaller models.

To support large-scale production inference of the 405B scale model, we quantized the model from 16-bit (BF16) to 8-bit (FP8) numbers, effectively reducing the required computational requirements and allowing the model to run within a single server node.

Instructions and Chat Fine-tuning

With Llama 3.1 405B, we have strived to improve the model's responsiveness to user instructions, quality, and detailed instruction following capabilities, while ensuring a high level of safety. The biggest challenge we faced was supporting more features, a 128K context window, and a larger model size.

In the later stages of training, we generated the final chat model by aligning several rounds on top of the pre-trained model. Each round involved supervised fine-tuning (SFT), rejection sampling (RS), and direct preference optimization (DPO). We used synthetic data generation to create the vast majority of SFT examples and went through multiple iterations to generate increasingly high-quality synthetic data covering all features. Additionally, we invested in a variety of data processing techniques to filter this synthetic data to the highest quality. This allowed us to scale the amount of fine-tuning data across features.

We carefully balanced the data to produce a high-quality model on all features. For example, even with the expansion to 128K context, our model maintains quality on short-context benchmarks. Similarly, even as we added safety mitigations, our model continues to provide the most helpful answers.

Llama System

The Llama models have always been designed as part of an overall system that can coordinate multiple components, including calling external tools. Our vision goes beyond the foundational model, allowing developers to access a broader system that allows them to flexibly design and create custom products that align with their vision. This idea began last year when we first introduced components beyond the core LLM.

To continue our commitment to developing AI responsibly beyond the model layer and to help others do the same, we have released a complete reference system, including several example applications, and include new components such as Llama Guard 3 (multilingual safety model) and Prompt Guard (instant injection filters). These example applications are open-source, and the community can build upon them.

The component implementations in the Llama System vision are still very dispersed. This is why we have started collaborating with the industry, startups, and the broader community to help better define the interfaces of these components. To support this, we have published a request for comments on GitHub for what we call the "Llama Stack." The Llama Stack is a set of standardized and opinionated interfaces for how to build prescriptive toolchain components (fine-tuning, synthetic data generation) and agent applications. We hope these interfaces will be adopted throughout the ecosystem, which will help to more easily achieve interoperability.

We welcome feedback and proposals for improvement methods. We are excited to grow the ecosystem around Llama and lower the barriers for developers and platform providers.

Open-Source Drives Innovation

Unlike closed models, the weights of the Llama models are available for download. Developers can fully customize the model according to their needs and applications, train it on new datasets, and perform additional fine-tuning. This allows a broader community of developers and the world to fully realize the powerful capabilities of generative AI. Developers can fully customize their applications and run them in any environment, including locally, in the cloud, or even on a local laptop - all without sharing data with Meta.

Although many may think that closed models are more cost-effective, according to tests by AI Analytics, the per-token cost of the Llama model is the lowest in the industry. As Mark Zuckerberg said, open-source will ensure that more people around the world can enjoy the benefits and opportunities of artificial intelligence, power will not be concentrated in the hands of a few, and the technology can be deployed more evenly and safely throughout society. This is why we continue to take steps to make open-source artificial intelligence the industry standard.

We have seen the community build amazing things with past Llama models, including AI learning companions built with Llama and deployed in WhatsApp and Messenger, LLMs tailored for the medical field designed to help guide clinical decisions, and a healthcare nonprofit startup in Brazil that enables healthcare systems to more easily organize and communicate patient hospital information in a data-secure manner. With the power of open-source, we can't wait to see what they build with our latest model.

Building with Llama 3.1 405B

For general developers, using a 405B-scale model is a challenge. While it is a very powerful model, we recognize that using it requires a lot of computational resources and expertise. We have communicated with the community, and we realize that generative AI development is not just about prompting models. We want to enable everyone to make the most of 405B, including:

Real-time and batch inference
Supervised fine-tuning
Evaluating your model for your specific application
Continuous pre-training
Retrieval