Computational Chemistry and Machine Learning in the Vogiatzis Group
Research in the Vogiatzis Group centers on the development of computational methods based on electronic structure theory and machine learning algorithms for describing chemical systems relevant to clean, green technologies.
“We are particularly interested in new methods for non-covalent interactions and bond-breaking reactions of small molecules with transition metals,” Vogiatzis said. “Our overall objectives are to elucidate the fundamental physical principles underlying the reactivity and properties of molecules and materials, as well as to assist in the interpretation of experimental data.”
In June 2020, the group was published in Nature Communications for their work “Representation of molecular structures with persistent homology for machine learning applications in chemistry.” This was a unique collaborative opportunity between chemistry department’s Jacob Townsend, graduate student, John Hymel, undergraduate student, Konstantinos Vogiatzis, assistant professor, along with Cassie Micucci and Vasileios Maroulas, Department of Mathematics. The group presents a novel molecular representation method based on persistent homology, an applied branch of topology, which encodes the atomistic structure of molecules.
They began their study by computing with density functional theory (DFT) the CO2 interaction energies of 100 organic molecules. “Since the initial, limited 100 data points were not capturing the diversity of the GDB-9 database, we applied a technique called active learning in order to incrementally obtain data which helped us efficiently screen the 133,885 molecules,” Vogiatzis said. “We found out that the combination of PIs with active learning performed well with data (interaction energies) from only 220 molecules in order to identify new molecules with stronger CO2 binding.”
Their data-driven methodology was able to identify molecular patterns previously unknown to us that increase the CO2 affinity of organic molecules.
The Vogiatzis Group broke a record with their work “Transferable MP2-Based Machine Learning for Accurate Coupled-Cluster Energies.”
Machine learning methods have enabled the low-cost evaluation of molecular properties such as energy at an unprecedented scale. While many of such applications have focused on molecular input based on geometry, few studies consider representations based on the underlying electronic structure.
Directing the attention to the electronic structure offers a unique challenge that allows for a more detailed representation of the underlying physics and how they affect molecular properties. The target of this work is to efficiently encode a lower-cost correlated wave function derived from MP2 to predict a higher-cost coupled-cluster singles-and-doubles (CCSD) wave function based on correlation-pair energies and the contributing electron promotions (excitations) and integrals.
The new molecular representation explores the short-range behavior of electron correlation and utilizes distinct models that differentiate between two-electron promotions from the same molecular orbital or from two different orbitals. The group presents a re-engineered set of input features that provide an intuitive description of the orbital properties involved in electron correlation. The overall models are found to be highly transferable and size extensive, necessitating very few training instances to approach the chemical accuracy of a broad spectrum of organic molecules.
“Coupled-cluster theory is the level of theory that provides the most accurate quantum chemical results in a reasonable computational time. Typically, we need ~10 minutes for computing the energy of a small molecule with coupled-cluster and for a database with ~133,000 small molecules, we will need ~1,330,000 minutes or ~2.5 years of computations,” Vogiatzis said. “In this work, we demonstrated that we can use the results from only 100 coupled-cluster calculations for training a machine learning model that can predict, without loss of accuracy, the energy of the full 133,000 molecule database a few hours.”