I. Introduction
At Sakana AI, we develop cutting-edge foundational models using evolution-inspired concepts such as evolutionary optimization. The development of deep learning relies on researchers' trial and error and theoretical insights, and preference optimization algorithms are crucial for aligning Large Language Models (LLMs) with human preferences. As LLMs' abilities to generate hypotheses and write code have been enhanced, a question arises: Can AI be used to automate AI research and discovery processes? This year, we began using evolutionary algorithms to develop better LLM training methods and also used LLMs as better evolutionary algorithms. We pondered whether LLMs could propose better LLM training algorithms, a process we call LLM². We have published a report introducing the synthesis of new preference optimization algorithms using LLMs, discovering high-performing algorithms such as DiscoPOP, and have open-sourced related models, functions, and code. We have also collaborated with the University of Oxford and the University of Cambridge.
II. Method Introduction
Evolutionary Perspective on Objective Function Discovery
We employ an LLM-driven discovery method in three steps: First, provide the LLM with an initial task and problem description, which may include examples or previous evaluations and performance records; then the LLM outputs hypotheses, method names, and code implementations, which are used for internal loop training and storing performance results; finally, update the LLM's context with new evaluation results. This method is more general and can be used to design model architecture components or optimization algorithms, etc. In experiments, we found that the LLM-driven discovery process alternates between exploration, fine-tuning, and knowledge combination steps, not random searches, but rather combines concepts in a complementary manner.
Discovering Cutting-Edge Preference Optimization Methods
Offline preference optimization is used to align language models with human feedback. Over the past year, there have been various methods, but the objective functions have not varied much. We used the aforementioned method to discover new preference optimization algorithms, running approximately 100 generations and recording the best performers. We found that many LLM-named and created objective functions outperformed human-designed ones. After evaluation on other tasks such as AlpacaEval 2.0, the "LRML" loss performed well, and we named it Discovered Preference Optimization (DiscoPOP). It has interesting characteristics, such as non-convexity, and has good properties on other tasks and hyperparameters. Compared to existing methods like DPO, it scores higher and deviates less from the base model.
III. Future Outlook
There are many paths to further advance the automated discovery process, such as utilizing more information collected per generation, meta-learning, or modulating prompt structures for better sampling of candidate solutions. Our work highlights the potential of using modern AI to generate self-improvement processes. In the future, we envision this method operating in an open manner, with LLMs repeatedly proposing modifications to parts of themselves or other entities, ultimately feeding back to themselves. We have studied the code proposal capabilities of various LLMs and ultimately used GPT-4 to assess feasibility. We are also conducting more experiments with open-source LLMs with promising results. In the future, we want to use this discovery process in a closed loop to generate self-improving AI research with open models.