Meta Releases Four Open Source Models

Today, Meta FAIR has publicly released several new research outcomes. We hope that the research community will utilize them for innovation, exploration, and discovering new methods for large-scale AI applications. These works are built upon our key principles of openness, collaboration, excellence, and scale. We believe that access to state-of-the-art artificial intelligence can create opportunities for everyone. This is why we are committed to the continuous development of an open AI ecosystem.

For over a decade, Meta's Fundamental Artificial Intelligence Research (FAIR) team has been dedicated to advancing artificial intelligence through open research. As innovation in this field continues to develop rapidly, we believe that collaboration with the global AI community is more important than ever. Maintaining an open scientific approach and sharing our work with the community helps us stay true to our goal of building AI systems that are suitable for everyone and bring the world closer together.

Today, we are excited to share some of the latest FAIR research models with the global community. We have publicly released six research outcomes, focusing on the core themes of our work: innovation, creativity, efficiency, and responsibility. These releases include image-to-text and text-to-music generation models, multi-token prediction models, and technology for detecting AI-generated speech. By publicly sharing our early research work, we hope to inspire iteration and ultimately help advance AI in a responsible manner. We can't wait to see what the community builds with these latest versions and continue important conversations with the open-source community.

Meta Chameleon

As we shared in our research paper last month, Meta Chameleon is a series of models that can take text and images combined as input and output arbitrary text and image combinations for encoding and decoding with a single unified architecture. While most current late fusion models use diffusion-based learning, Meta Chameleon tokenizes both text and images. This enables a more unified approach and makes the model easier to design, maintain, and scale. The possibilities are endless - imagine generating creative titles for images or creating entirely new scenes using a mix of text prompts and images.

Today, we are publicly releasing key components of the Chameleon 7B and 34B models, in accordance with research licensing. The models we are releasing today have been safety-tuned to support mixed-mode input and pure text output for research purposes. While we have taken steps to develop these models responsibly, we recognize that risks remain. We will not be releasing the Chameleon image generation model at this time. With the models we are sharing today, we hope to encourage the research community to design new detection and mitigation strategies to help responsibly expand generative model research.

Multi-Token Prediction

Most modern language models have a simple training goal: predict the next word. While this approach is straightforward and scalable, it is also highly inefficient. It requires several orders of magnitude more text than what children need to learn the same level of language fluency.

In April, we proposed a new method for building better and faster LLMs using multi-token prediction. With this method, we can train language models to predict multiple future words at once, rather than the previous method of predicting one word at a time. This improves model capability and training efficiency while increasing speed. In the spirit of responsible open science, we will release pre-trained models for code completion under a non-commercial/research-only license. We hope this will allow the research community to independently study our methods and the behavior of training models.

Meta Joint Audio and Symbolic Conditioning for Time-Controlled Text-to-Music Generation

Generative artificial intelligence enables people to explore creativity in new ways, such as converting text prompts into music snippets. While existing text-to-music models (like MusicGen) primarily rely on text input to generate music, our new model "Meta Joint Audio and Symbolic Conditioning for Time-Controlled Text-to-Music Generation" (JASCO) can accept various conditional inputs, such as specific chords or beats, to improve control over the generated music output. Specifically, we combine information bottleneck layers with temporal blurring to extract information related to specific controls. This allows the combination of symbolic and audio-based conditions within the same text-to-music generation model.

Results show that JASCO is comparable to the evaluated baseline in terms of generation quality while allowing better and more flexible control over the generated music. Today, we will release the research paper and example pages. Later this month, we will release the inference code as part of the AudioCraft repository under the MIT license and the pre-trained model under CC-BY-NC. We look forward to releasing the code and model in the future.

AudioSeal

Generative AI tools are inspiring people to share their creations on social media with friends, family, and followers. As with all AI innovations, we must do our part to help ensure the responsible use of these tools. Today, we are releasing AudioSeal, which we believe is the first audio watermarking technique specifically designed for the local detection of AI-generated speech, capable of accurately pinpointing AI-generated segments within longer audio clips. AudioSeal improves on traditional audio watermarking by focusing on the detection of AI-generated content rather than steganography. Unlike traditional methods that rely on complex decoding algorithms, AudioSeal's local detection approach enables faster and more efficient detection. This design increases detection speed by 485 times compared to previous methods, making it highly suitable for large-scale and real-time applications. Our approach achieves state-of-the-art performance in robustness and imperceptibility of audio watermarking.

AudioSeal is released under a commercial license. This is just one of several responsible research avenues we are sharing to prevent the misuse of generative AI tools. We have embedded similar watermarks in SeamlessM4T v2 (our foundational text and speech translation model) and Audiobox-generated speech samples. We have further detailed watermarking methods for image, speech, and text models in our recent releases.

Supporting the Release of the PRISM Dataset Partnership

Getting feedback from a diverse group of people is crucial for improving LLM levels, but the research community has always questioned the methods, domains, and targets of the feedback process. We have collaborated with external partners to address these issues and support the release of the PRISM dataset, which maps the socio-demographic data and preferences of 1,500 diverse participants from 75 countries/regions. The dataset maps each person's preferences and fine-grained feedback to 8,011 live conversations with 21 different LLMs.

Meta advised our external partners in compiling the PRISM dataset, focusing on conversations centered on subjective and multicultural perspectives that may have interpersonal and cross-cultural differences. Our paper demonstrates the utility of PRISM through three case studies of conversation diversity, preference diversity, and welfare outcomes, showing that it is important who sets consistent norms. While we hope this will become a community resource, we also hope it will inspire broader participation in AI development and promote more inclusive technological design approaches.

Obtaining the dataset from our external partners
Reading the technical report

Measuring and Improving Geographical Bias in Text-to-Image Generation Systems

It is important that text-to-image models are suitable for everyone and reflect the geographical and cultural diversity of the world. Improving these models requires new tools that enable researchers to better understand potential shortcomings. To achieve this goal, we detailed our recent research work and progress:

We have developed an automatic metric called "DIG In" to assess potential geographical bias in text-to-image models. In addition, to understand how people's views on geographical representation differ across regions, we conducted a large-scale annotation study. We collected over 65,000 annotations and over 20 survey responses per example, covering attractiveness, similarity, consistency, and shared suggestions, to improve automatic and manual evaluation of text-to-image models.

Through this research, we learned that people utilize specific components in images, rather than viewing the entire image holistically, when perceiving geographical representations. As part of the Meta FAIR collaborative approach, we guided a graduate student team at the University of Massachusetts Amherst to conduct follow-up evaluations, breaking down the previously introduced automatic metrics into foreground concepts and background representations.

Guided by the DIG In measurement work, we also explored methods to increase the diversity of text-to-image model outputs. In this direction, we introduced the contextualized Vendi Score guidance, which extends our previous feedback guidance work and uses inference-time interventions to guide state-of-the-art text-to-image latent diffusion models to increase the representational diversity of generated samples while maintaining or improving image quality and prompt generation consistency.