Exclusive Interview with Luma AI's Chief Scientist: We Believe More in the Scaling Laws of Multimodality

Luma AI has garnered widespread attention with its video generation model, Dream Machine, which is capable of generating videos with large action ranges and demonstrates excellent understanding of the physical world. Chief Scientist Jiaming Song explained in the interview why Luma AI shifted from the 3D field to video generation, and how video generation helps them better understand and generate 3D content. He emphasized the importance of multimodal data in enhancing the model's understanding capabilities and discussed Luma AI's vision for future AI products and business models. Jiaming Song also mentioned the balance between research and product development at Luma AI, and how they use user feedback to guide the company's development direction. Furthermore, he explored the future trends in AI technology development, including the potential of multimodal models, improvements in generation efficiency, the possibility of cost reduction, and the exploration of new business models.

I. Dream Machine Model

Model Introduction

Jiaming Song is involved in full-stack work related to model training in the Dream Machine project. The model adopts the DiT architecture, which is characterized by a larger range of motion, although it brings controllable issues, but it is considered important for user experience.

The large range of motion is mainly driven by the model and data scale, and previous models with smaller scales were difficult to achieve the desired effect.

Comparison with Other Models

Different from the Pika scheme, it is similar to Sora and Runway Gen-3, and has a stronger association with Sora, all of which are based on the diffusion transformer architecture.

Currently, it is mainly a to C product form, and there are also API demands. The future product form depends on the model's capabilities and market feedback.

II. Video is a Better Route to 3D

Reason for Shifting from 3D to Video Generation

To create better 4D, the choice was made to do video generation. One way to achieve 4D is to generate 3D from images and then turn it into 4D. Another is to directly do video models and then turn them into 4D, which is considered more reliable.

3D data is limited and needs to rely on larger models driven by more data.

The Driving Effect of Video on 3D

Video generation has a strong 3D capability, such as the video model's ability to learn depth knowledge, even abstract pictures can simulate relevant information.

It can understand the reflection and refraction of light, and can also simulate effects on different materials, which has advantages over traditional NeRF.

It can simulate dynamic scenes, fabrics, etc., but there are also imperfect cases, such as not conforming to physical principles and multi-head issues.

Discussion on Video Model's Understanding of the World

The video generation model's understanding of the physical rules of the world may emerge as the model scales up, similar to the development trend of language models.

Many problems that are difficult to solve now may be solved with paradigm shifts, and the development of video models is still in its infancy.

Luma's Advantages and Differences

In terms of technology, there is a focus on generation speed and efficiency optimization, which affects user experience and business models.

There is also an expectation for controllability, which is different from the market like Kuaishou, and focuses more on the development of new products based on future model trends.

Cost Issues

It is uncertain why Sora has not been opened, but it is believed that costs will decrease and new application states will emerge. In addition to increasing GPUs, algorithmic innovation is also needed.

III. How Luma Defines Itself

Company Nature

It should have the innovation ability of a research lab and the agility of a product. A dozen people participated in the Dream Machine model, and engineering capabilities are important. People with a 3D background are strong in engineering capabilities, and the team has not encountered major problems in capabilities. The main challenge is to unify internal goals.

Company Positioning

It is not defined as a specific company in the 3D or video field, but more believes in the scaling laws of vision or multimodality. Both research and products are important, and the company's business is defined according to user feedback and technology trends. It will not suddenly transform into hardware.

Business Model

The situation of paying users and ARR is good, but it is not necessarily to obtain a positive cash flow at this stage. The improvement of model capabilities may change the business model. The business direction of to C or to B is not clear for the time being, and the focus is on making models and products better.

Development of Multimodal Models

It is believed that multimodal generation can be made into an end-to-end model in the future, but efficiency issues need to be considered in different scenarios. It is related to Vision Pro and spatial computing, and when the time is right, it may make related APPs, which are more related to 4D. It is on the lookout for Li Feifei's entrepreneurial projects, and believes that there are fewer engineering and product personnel.

Research Directions

Jiaming Song is interested in directions such as solving the sequence length problem of transformers, understanding existing models, and the scale problem of diffusion. DiT is to use the autoregressive method for diffusion training, and the improvement ideas of related architectures may be universal on different models, but there are differences between diffusion and autoregressive.