Achieving More with Less Learning Rate

Adam-mini can significantly optimize the efficiency of model training: reducing the memory usage of AdamW by 45%-50%. At the same time, it has achieved comparable or better performance than AdamW in pre-training, fine-tuning, and RLHF tasks of large language models. Adam-mini has significantly increased training throughput. For example, in the pre-training of Llama2-7B, it is 49.6% faster than AdamW. The core idea of Adam-mini is: To reduce the use of learning rate resources in Adam, by dividing parameters into blocks according to the structure of the Hessian matrix, and assigning a single but effective learning rate for each block.

In the field of optimizer research, we have innovatively proposed an optimizer named Adam-mini. This optimizer has significant advantages, and its performance can be comparable to the widely used AdamW optimizer, and even better in some cases. More importantly, it has achieved a significant reduction in memory usage, reducing the memory usage by 45% to 50% compared to AdamW.

The core principle of Adam-mini to achieve memory reduction is to adjust the learning rate resources in Adam reasonably. Specifically, it reduces the learning rate resources according to the rule of 1/v√, thereby achieving the purpose of reducing memory. In this process, after in-depth research and analysis, we found that a considerable proportion (≥90%) of the learning rate v in Adam can be harmlessly deleted if certain conditions are met.

These conditions mainly include two aspects: First, we need to divide the parameters into different blocks according to the proposed Hessian structure principle. This division is not arbitrary, but is based on an in-depth understanding of the model structure and parameter relationships to ensure that the parameters within each block have certain internal relevance and similarity. Second, for each divided parameter block, we only assign a single learning rate, but this learning rate is carefully selected, and it is a learning rate that can adapt well to the characteristics of the parameter block.

Further research has also found that for each parameter block, there is a particular high-quality learning rate. When using this learning rate, the performance of the parameter block can surpass the performance when using the Adam optimizer. Of course, finding this high-quality learning rate requires certain conditions, that is, sufficient resources for searching and determining.

Based on the above research results, we have provided an economical and effective method to find these good learning rates suitable for each parameter block, and finally proposed the Adam-mini optimizer. From the perspective of practical experience verification, we have tested Adam-mini on a variety of different language models, including during the pre-training phase, supervised fine-tuning phase, and using human feedback reinforcement learning (RLHF) process. The test results show that Adam-mini's performance on these different application scenarios and language models is comparable to AdamW, and even better in many cases.

The advantage of reducing memory usage brought by Adam-mini also brings other positive impacts. Due to the reduction of memory usage, the communication overhead between GPU and CPU has also been correspondingly reduced. This reduction in communication overhead is of great significance for the performance improvement of the entire system, and it directly leads to an increase in system throughput. For example, in the pre-training of Llama2-7B, we can clearly see the advantages of Adam-mini. Compared with using AdamW, the throughput of Adam-mini has increased by 49.6%. If using 2 × A800-80GB GPU for pre-training, using Adam-mini can also save 33% of the wall-clock time for pre-training, which undoubtedly greatly improves the efficiency of pre-training.