*Editors:* Kristof T. Schütt, Stefan Chmiela, Anatole von Lilienfeld, Alexandre Tkatchenko, Koji Tsuda, Klaus-Robert Müller

The upcoming book covers the topics of the IPAM long program "Understanding Many-Particle Systems with Machine Learning" and our recently organized NIPS workshop "Machine Learning for Molecules and Materials". It will be composed of technical parts covering representations for molecules and materials, ML across chemical compound space and ML for potential energy surfaces.

Login with your 8-character user ID:

Introduction | |

[contribution of user mueller will appear here: Introduction to kernel learning] | |

[contribution of user hermann will appear here: tbd] | |

Grégoire Montavon: Introduction to Neural NetworksNeural networks are powerful and scalable learning machines that are capable of building highly complex and structured nonlinear functions. While this makes them a priori good candidates for unraveling the complexity of physical systems, the real-valued nature of the data, the high signal-to-noise ratio, and the need to extrapolate from limited data, can make it difficult to take most advantage of neural networks in practice. This chapter first reviews key concepts underlying neural networks, in particular, backpropagation, and the analysis of the error function. Then, we provide a list of practical steps to be taken for training a model successfully, and explain how basic physics knowledge can be efficiently incorporated into the model. Finally, we introduce a technique to extract human-interpretable insight from the neural networks predictions. | |

Representations | |

Gábor Csányi, Michael Willatt, Michele Ceriotti: Machine-learning of atomic-scale properties based on physical principlesWe briefly summarize the kernel regression approach, as used recently in materials modelling, to fitting functions, particularly potential energy surfaces, and highlight how the linear algebra framework can be used to both predict and train from linear functionals of the potential energy, such as the total energy and atomic forces. We then give a detailed account of the Smooth Overlap of Atomic Positions (SOAP) representation and kernel, showing how it arises from an abstract representation of smooth atomic densities, and how it is related to several popular density-based representations of atomic structure. We also discuss recent generalisations that allow fine control of correlations between different atomic species, prediction and fitting of tensorial properties, and also the how to construct structural kernels---applicable to comparing entire molecules or periodic systems---that go beyond an additive combination of local environments. | |

Matti Hellström and Jörg Behler: High-Dimensional Neural Network Potentials for Atomistic SimulationsHigh-dimensional neural network potentials, proposed by Behler and Parrinello in 2007, have become an established method to calculate potential energy surfaces with first-principles accuracy at a fraction of the computational costs. The method is general and can describe all types of chemical interactions (e.g. covalent, metallic, hydrogen bonding, and dispersion) for the entire periodic table, including chemical reactions, in which bonds break or form. Typically, many-body atom-centered symmetry functions, which incorporate the translational, rotational and permutational invariances of the potential-energy surface exactly, are used as descriptors for the atomic environments. This chapter describes how such symmetry functions and high-dimensional neural network potentials are constructed and validated. | |

Felix Andreas Faber, Anders S. Christensen and O. Anatole von Lilienfeld: Quantum Machine Learning with Response Operators in Chemical Compound SpaceThe choice of how to represent a chemical compound has a consider- able effect on the performance of quantum machine learning (QML) models based on kernel ridge regression (KRR). A carefully constructed representation can lower the prediction error for out-of-sample data by several orders of magnitude with the same training data. This is a particularly desirable effect in data scarce scenarios, such as they are common in first principles based chemical compound space explo- rations. Unfortunately, representations which result in KRR models with low and steep learning curves for extensive properties, such as energies or polarizabilities, but do not necessarily lead to well performing models for response properties. In this chapter we review the recently introduced FCHL18 representation [1], in combina- tion with kernel-based QML models to account for response properties by including the corresponding operators in the regression [2]. FCHL18 was designed to describe an atom in its chemical environment, allowing to measure distances between ele- ments in the periodic table, and consequently providing a metric for both structural and chemical similarities between compounds. QML models using FCHL18 display low and steep learning curves for energies of molecules, clusters, and crystals. By contrast, the same QML models exhibit less favorable learning for other properties, such as forces, electronic eigenvalues, or dipole moments. We discuss the use of the electric field differential operator within a kernel-based operator QML (OQML) approach resulting in the same predictive accuracy for molecular dipole-moments as conventional QML with up to 20x less training instances. | |

Justin Gilmer, Samual S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl: Message Passing Neural NetworksSupervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation procedure to compute a function of their entire input graph. In this chapter, we describe a general common framework for learning representations on graph data called Message Passing Neural Networks (MPNNs) and show how several prior neural network models for graph data fit into this framework. This chapter contains large overlap with [10], and has been modified to highlight more recent extensions to the MPNN framework. | |

[contribution of user chmiela will appear here: Representations, domain knowledge, invariances] | |

Kristof T. Schütt, Alexandre Tkatchenko, Klaus-Robert Müller: Learning representations of molecules and materials with atomistic neural networksDeep Learning has been shown to learn efficient representations for structured data such as image, text or audio. In this chapter, we present neural network architectures that are able to learn efficient representations of molecules and materials. In particular, the continuous-filter convolutional network SchNet accurately predicts chemical properties across compositional and configurational space on a variety of datasets. Beyond that, we analyze the obtained representations to find evidence that their spatial and chemical properties agree with chemical intuition. | |

Atomistic Simulations | |

Frank Noe: Machine Learning for Molecular Dynamics on Long TimescalesAbstract Molecular Dynamics (MD) simulation is widely used to analyze the properties of molecules and materials. Most practical applications, such as comparison with experimental measurements, designing drug molecules, or optimizing materials, rely on statistical quantities, which may be prohibitively expensive to compute from direct long-time MD simulations. Classical Machine Learning (ML) techniques have already had a profound impact on the field, especially for learning low-dimensional models of the long-time dynamics and for devising more efficient sampling schemes for computing long-time statistics. Novel ML methods have the potential to revolutionize long-timescale MD and to obtain interpretable models. ML concepts such as statistical estimator theory, end-to-end learning, representation learning and active learning are highly interesting for the MD researcher and will help to develop new solutions to hard MD problems. With the aim of better connecting the MD and ML research areas and spawning new research on this interface, we define the learning problems in long-timescale MD, present successful approaches and outline some of the unsolved ML problems in this application field. | |

Aldo Glielmo, Claudio Zeni, Ádám Fekete, Alessandro De Vita: Building nonparametric $n$-body force fields using Gaussian process regressionConstructing a classical potential suited to simulate a given atomic system is a remarkably difficult task. This chapter presents a framework under which this problem can be tackled, based on the Bayesian construction of nonparametric force fields of a given order using Gaussian process (GP) priors. The formalism of GP regression is first reviewed, particularly in relation to its application in learning local atomic energies and forces. For accurate regression it is fundamental to incorporate prior knowledge into the GP kernel function. To this end, this chapter details how smoothness, invariance and interaction order of a force field can be encoded into the corresponding kernel. A range of kernels is then proposed, possessing all the required properties and an adjustable parameter $n$ governing the interaction order modelled.The order $n$ best suited to describe a given system can be found within the Bayesian framework by maximisation of the marginal likelihood. The procedure is first tested on a toy model of known interaction and later applied to two real materials described at the density functional theory level of accuracy. The results show that lower order (simpler) models should be preferred not only for "simpler" systems, but also for more complex ones when small or moderate training set sizes are used. Low $n$ GPs can be further sped up by orders of magnitude by constructing the corresponding tabulated force field, here named "MFF". | |

Rodrigo Vargas-Hernandez: Gaussian Process Regression for Extrapolation of Properties of Complex Quantum Systems across Quantum Phase TransitionsFor applications in chemistry and physics, machine learning is generally used to solve one of three problems: interpolation, classification or clustering. These problems use information about physical systems in a certain range of parameters or variables in order to make predictions at unknown values of these variables within the same range. The present work considers the application of machine learning to extrapolation of physical properties beyond the range of the training parameters. We show that Gaus- sian processes can be used to build machine learning models capable of extrapolating the quantum properties of complex systems across quantum phase transitions. The approach is based on training Gaussian process models of variable complexity by the evolution of the physical functions. We show that, as the complexity of the models increases, they become capable of predicting new transitions. We also show that, where the evolution of the physical functions is analytic and relatively simple (the function considered here is a + b/x + c/x3), Gaussian process models with simple kernels (such as a simple Gaussian) yield accurate extrapolation results. We thus argue that Gaussian processes can be used as a meaningful extrapolation tool for a wide variety of problems in physics and chemistry. We discuss strategies to prevent overfitting and obtain meaningful extrapolation results without validation. | |

[contribution of user sauceda will appear here: Construction and application of machine learned force fields with coupled cluster accuracy] | |

Michael Gastegger, Philipp Marquetand: Molecular dynamics with neural-network potentialsMolecular dynamics simulations are an important tool for describing the evolution of a chemical system with time. However, these simulations are inherently held back either by the prohibitive cost of accurate electronic structure theory computations or the limited accuracy of classical empirical force fields. Machine learning techniques can help to overcome these limitations by providing access to potential energies, forces and other molecular properties modeled directly after an accurate electronic structure reference at only a fraction of the original computational cost. The present text discusses several practical aspects of conducting machine learning driven molecular dynamics simulations. First, we study the efficient selection of reference data points on the basis of an active learning inspired adaptive sampling scheme. This is followed by the analysis of a machine-learning based model for simulating molecular dipole moments in the framework of predicting infrared spectra via molecular dynamics simulations. Finally, we show that machine learning models can offer valuable aid in understanding chemical systems beyond a simple prediction of quantities. | |

Discovery | |

Alexander Shapeev, Konstantin Gubaev, Evgenii Tsymbalov, Evgeny Podryabinkin: Active learning and Uncertainty EstimationActive learning refers to collections of algorithms of systematically con- structing the training dataset. It is closely related to uncertainty estimation— we, generally, do not need to train our model on samples on which our prediction already has low uncertainty. This chapter reviews active learn- ing algorithms in the context of molecular modeling and illustrates their applications on practical problems. | |

Anand Chandrasekaran, Chiho Kim and Rampi Ramprasad: Polymer Genome: A polymer informatics platform to accelerate polymer discoveryThe Materials Genome Initiative has brought about a paradigm shift in the design and discovery of novel materials. In a growing number of applications, the materials innovation cycle has been greatly accelerated as a result of insights provided by data-driven materials informatics platforms. High-throughput computational methodologies, data descriptors and machine learning are playing an increasingly invaluable role in research development portfolios across both academia and industry. Polymers, especially, have long suffered from a lack of data on electronic, mechanical, and dielectric properties across large chemical spaces, causing a stagnation in the set of suitable candidates for various applications. The nascent field of polymer informatics seeks to provide tools and pathways for accelerated polymer property prediction (and materials design) via surrogate machine learning models built on reliable past data. With this goal in mind, we have carefully accumulated a dataset of organic polymers whose properties were obtained either computationally (bandgap, dielectric constant, refractive index and atomization energy) or experimentally (glass transition temperature, solubility parameter, and density). A fingerprinting scheme that captures atomistic to morphological structural features was developed to numerically represent the polymers. Machine learning models were then trained by mapping the polymer fingerprints (or features) to their respective properties. Once developed, these models can rapidly predict properties of new polymers (within the same chemical class as the parent dataset) and can also provide uncertainties underlying the predictions. Since different properties depend on different length-scale features, the prediction models were built on an optimized set of features for each individual property. Furthermore, these models are incorporated in a user friendly online platform named Polymer Genome (\texttt{www.polymergenome.org}). Systematic and progressive expansion of both chemical and property spaces are planned to extend the applicability of Polymer Genome to a wide range of technological domains. | |

Daniel Schwalbe Koda; Rafael Gómez-Bombarelli: Generative Models for Automatic Chemical DesignMaterials discovery is decisive for tackling urgent challenges related to energy, the environment or health care. In chemistry, conventional methodologies for innovation usually rely on expensive and incremental optimization strategies. Building a reliable mapping between structures and properties enables navigating molecular space efficiently and much more rapid design and optimization of novel useful compounds. In this chapter, we review how current deep learning and generative models address this inverse chemical design paradigm. We begin introducing generative models in deep learning and categorizing them according to their architecture and molecular representation. The evolution and performance of popular chemical generative models in the literature are then reviewed. Finally, we highlight the prospects and challenges of the automatic chemical design as a cutting edge tool in materials development and technological progress. | |

Rickard Armiento: Database-driven High-Throughput Calculations and Machine Learning Models for Materials DesignThis chapter reviews past and ongoing efforts in using high-throughput ab-inito calculations in combination with machine learning models for materials design. The primary focus is on bulk materials, i.e., materials with fixed, ordered, crystal structures, although the methods naturally extend into more complicated configurations. Efficient and robust computational methods, computational power, and reliable methods for automated database-driven high-throughput computation are combined to produce high-quality data sets. This data can be used to train machine learning models for predicting the stability of bulk materials and their properties. The underlying computational methods and the tools for automated calculations are discussed in some detail. Various machine learning models and, in particular, descriptors for general use in materials design are also covered. | |

Zhufeng Hou, Koji Tsuda: Bayesian Optimization in Materials ScienceBayesian optimization (BO) algorithm is a global optimization approach, and it has been recently gained growing attention in material science field for the search and design of new functional materials. Herein, we briefly give an overview of the recent applications of BO algorithm in the determination of the physics pa- rameters of a physics model, the design of experimental synthesis conditions, the discovery of functional materials with targeted properties, and the global optimiza- tion of atomic structures. The basic methodologies of BO in these applications are also addressed. | |

Atsuto Seko and Hiroyuki Hayashi: Recommender systems for the materials discoveryChemically relevant compositions (CRCs) and atomic arrange- ments of inorganic compounds have been collected as inorganic crystal struc- ture databases. Machine-learning is a unique approach to search for currently unknown CRCs from vast candidates. Firstly, we show matrix- and tensor- based recommender system approaches to predict currently unknown CRCs from database entries of CRCs. The performance of the recommender system approaches to discover currently unknown CRCs is examined. Secondly, we demonstrate classification approaches using compositional similarity defined by descriptors obtained from a set of well-known elemental representations. They indicate that the recommender system has great potential to accelerate the discovery of new compounds. | |

[contribution of user heidarzadeh will appear here: *TBA*] |

The following page provides information and the latex style files to prepare the chapters for LNP:

https://www.springer.com/de/authors-editors/book-authors-editors/manuscript-preparation/5636