Large-scale Drug-like Molecule Quantum Chemistry Dataset and Neural Network Model Benchmarking

This paper introduces a new dataset called ∇2DFT, which contains quantum chemical properties of approximately 2 million drug-like molecules. Researchers have tested several state-of-the-art neural network models with this dataset to see how they perform in predicting molecular energy, interatomic forces, and Hamiltonian matrices, among other tasks. They have also specifically tested these models' capabilities in optimizing molecular conformations. Overall, this dataset and benchmark provide an important resource for developing better quantum chemistry machine learning models.

In the field of chemical sciences, computational quantum chemistry methods hold a crucial position. They provide accurate approximations for molecular properties, a characteristic that plays an indispensable role in computer-aided drug discovery and many other areas related to chemical sciences. However, these methods are not without flaws, with one major challenge being their high computational complexity. This complexity greatly limits their scalability in practical applications, hindering their use in scenarios requiring large-scale applications.

In seeking to address this issue, neural network potentials (NNPs) have emerged as a potential alternative, showing considerable promise to break through the limitations of computational quantum chemistry methods. However, NNPs also have their own requirements and challenges. They require a large amount of diverse datasets for training in order to fully leverage their advantages and achieve better performance.

In response to this situation, this work has conducted in-depth research and exploration, proposing an innovative dataset and benchmark based on nablaDFT, named DFT. The DFT dataset has many notable features and advantages. It contains twice as many molecular structures and three times as many conformations as previous datasets. Moreover, it covers new data types and tasks and is equipped with state-of-the-art models. In terms of data content, it includes rich information such as energy, force, up to 17 molecular properties, Hamiltonian and overlap matrices, and wave function objects. To ensure the accuracy and reliability of the data, all calculations are strictly executed at the DFT level (B97X-D/def2-SVP) for each conformation. Additionally, it is particularly worth emphasizing that DFT is the first dataset to include a large number of drug-like molecule relaxation trajectories, a feature that gives it unique value in the field of drug research and related areas.

In addition to proposing this dataset, we have further introduced a new benchmark specifically designed to evaluate the performance of NNPs in different tasks, including molecular property prediction, Hamiltonian prediction, and conformation optimization tasks. Through this benchmark, the performance level of NNPs in various key tasks can be more accurately understood, providing a strong basis for their further development and optimization.

Finally, we have proposed a scalable NNP training framework. In this framework, we have successfully implemented 10 models. The scalability of this training framework provides a solid foundation for the widespread application and continuous development of NNPs, enabling them to better adapt to different application scenarios and needs, further enhancing the feasibility and practicality of NNPs as an alternative to quantum chemistry methods.