Week 02 Lecture: Machine Learning at the Intersection with Molecular Simulations

Week 02 Lecture: Machine Learning at the Intersection with Molecular Simulations¶

A Brief Survey¶

In the next four weeks, we will cover the following topics.

Basic concepts
Neural network architectures
Generators
State-of-the-art applications: AlphaFold2

What is the benefit of ML for simulators?¶

ML facilitates analysis of complex simulation data and guides sampling.
ML provides ‘optimal’ force fields.
ML now provides accurate, experiment-like structures of proteins.

“ML complements physics-based approaches and accelerates the understanding of biology.”

What is the benefit of MD for machine learners?¶

MD provides data for training, e.g. to learn how to generate dynamics

“Why do we need to ‘understand’? ML will learn and predict everything!

What is the idea of Machine Learning?¶

learning

A model is built (automatically) based on training data:

\[ action(dog_i) = f(interactions, treats) \]

trained

and then used to predict outcome based on input.

Typical tasks of ML¶

Regression¶

Supervised learning

input: featurized data
output: continuous values

Example: Prediction of quantum mechanical energies from molecular configurations

\( <\Phi_0|e^{-T}He^{T}|\Phi_0> = E \)

Data classification¶

Supervised learning

input: featurized data
output: predicted categories

Example: Prediction of secondary structure from sequence

Dimensionality reduction and feature extraction¶

Unsupervised learning

input: high-dimensional data (e.g. MD simulations)
output: projection of data onto low-dimensional collective variable space

Example: Conformational clustering and identification of states sampled via MD

Generation of high-dimensional data¶

Supervised learning via advanced deep learning architectures

input: desired features of generated objects
output: ensembles of objects based on desired features and consistent with expected properties

Example: High-accuracy protein structure generation based on sequence

Models used in ML¶

In the most general sense, ML models map input data to output data.

\( \color{red}{Y}\color{black}{ = f(}\color{blue}{X}; \color{green}{W}; \color{purple}{P}; \color{grey}{N}\color{black}{)} \)

Model Input:

\( \begin{matrix} \color{blue}{\mbox{X}} & \mbox{input data} \\ \color{green}{\mbox{W}} & \mbox{weights to be optimized} \\ \color{purple}{\mbox{P}} & \mbox{model parameters} \\ \color{grey}{\mbox{N}} & \mbox{random noise (optional)} \\ \end{matrix} \)

The objective of ML is to find an optimal mapping based on training data and other prior knowledge.

Functional regression¶

Linear regression

\( y = ax + b \)

Uses: Prediction of continous value output

Logistic regression¶

Probability of observing certain outcome as a function of input \(x\):

\( \log(\frac{p}{1-p}) = \beta_0 + \beta_1*x \)

Uses: Classification

Support vector machines¶

The goal is to find a hyperplane that maximally separates data into two (or more) classes:

\( \textbf{w}^T \textbf{x} - b = 0 \)

with

\( \textbf{w}^T \textbf{x} - b \geq 1 \) for \( \color{blue}{\mbox{class 1}} \)
\( \textbf{w}^T \textbf{x} - b \leq 1 \) for \( \color{green}{\mbox{class 2}} \)

Uses: Classification

Decision Trees and Random Forests¶

Decision trees generete discrete outcomes from input data based on a series of ‘decisions’.

Random forest methods combine output from multiple decision trees based on random subsets of input data.

Uses: Classification

Hidden Markov Models¶

Markov model describes transition probabilities between (hidden) states \(X_i\) and encodes probabilities of possible outcomes \(y_i\).

Uses: Analysis, data encoding, classification

Neural Networks¶

Connect input to output via matrix operations and activation functions. Multiple network layers can be combined.

Uses: Regression, classification, analysis, encoding, generation

Neural Network Architectures¶

Neural networks often involve many layers (deep learning).

Activation functions¶

Activation functions are essential for modeling non-linear relationships and for gaining the full benefit of neural networks.

Glossary of activation functions

ML in practice¶

Define problem suitable for ML¶

What will be the input?
What will be the output?
What are the performance expectations?

Input data features¶

A good choice of input data features is essential for the success of machine learning.

general molecular properties (composition, mass, charge)
atomic coordinates (subject to rotational variance)
internal coordinates (distances, angles)
dynamic and ensemble information
sequences and multiple sequence alignments
classification based on previous knowledge
experimental data (e.g. density maps)

Target output data¶

The target output data should reflect the problem at hand and the availability of high-quality training data.

continuous values (e.g. energies, forces, molecular properties)
classification (e.g. secondary structure, topology, function)
latent space projection
embedding (encoding of data for further use)
distributions (e.g. distances, angles)
interactions (e.g. intra-/intermolecular contacts, ligands, ions)
structures (or parts of it)
dynamic ensembles (or aspects of dynamics)
quality assessment (e.g. model accuracy or experimental uncertainties)

Model design¶

The main consideration is the balance between model complexity, accuracy, and ease of training.

Neural networks are flexible and powerful but may be difficult to train
Deep networks may be more accurate and transferable but require more data
Deep networks are more difficulty to train than shallow models
Optimal model architecture should match input/output data shape (→ next week)
Computer hardware may limit model choices

Start from established models! (→ next week)

Model choices and hyper-parameters should be optimized as part of model training.

Model training¶

The goal of training a specific model is to find optimal weights.

This is the key step of machine learning that requires the most effort and computer resources.

How do we know which weights are optimal?¶

We define a loss function based on training data, e.g. MSE (mean-squared error) that should be minimized:

\[ J_{MSE} = \frac{1}{N} \sum_{k=1}^{N}(y_k-\hat{y}_k)^2 \]

\(y_k(\textbf{x}_{k},\textbf{w},\textbf{p})\) is the model-predicted output given weights \(\textbf{w}\) and model parameters \(\textbf{p}\) for training data item \(k\).

\(\hat{y}_k\) is the expected output from the training data.

How do we find optimal weights?¶

We typically use a gradient descent minimizer (SGD: Stochastic Gradient Descent; Adam):

The minimizer updates weights iteratively according to: \(w_{new} = w_{old} - \lambda \frac{\partial J(w)}{\partial w} \)

\(\lambda\) is the learning rate and a key parameter that may be varied during optimization.

How do we obtain gradients?¶

Gradients are usually obtained via back-propagation, i.e. application of the chain rule.

When do we stop with training?¶

When the loss function stops decreasing.

It is better to monitor model performance on a validation set and stop when loss function on validation set starts to increase (indicates overfitting).

How do we optimize training performance?¶

Use GPUs or TPUs with lots of memory.
Batch processing to manage large training data sets.
Adaptive learning rate (start slow, faster steps once optimization converges)
Use pre-trained models and apply transfer learning
Explore different model architectures and different activation functions
Regularize input data (training is easiest with values in 0-1 range)

Data¶

Data is critical for machine learning. The data should provide reliable mappings between input features and target output data from which a model can be learned.

To develop rigorous models, data should be divided into three sets:

training data (70-90%): these data are used directly for training the model
validation data (10-30%): these data are used for testing model performance during training to determine when to stop and to optimize hyper-parameters
test data: these (separate) data are used for evaluating the accuracy of the final models and may consist of benchmarks that allow comparison with other methods or reference data from physical theory or experiments

Splitting the initial data into training and validation sets should be repeated randomly.

Data biases and incorporation of additional knowledge¶

ML models perform best on domains covered extensively input data and transfer poorly to other scenarios where no or little training data was available.

To prevent bias, training data should uniformly cover the broadest possible range of scenarios.

Data augmentation may provide additional data for areas not well-covered initially, e.g. for trivial cases.

Training data may be supplemented by constraints based on other knowledge that can be applied during training, e.g. physical constraints that disallow atom overlap or negative output values for properties that can only assume positive values.

Using ML models¶

Once we have a trained model, the model is easy to apply. All we need is the model architecture and the trained weights.

Forward evaluation is usually very fast, but can take time (and require GPU resources) for very large models. More significant time may be needed for generating the required input features (e.g. multiple sequence alignments as input to AlphaFold2).

The usefulness of a given ML model greatly depends on the training data. If the training data is limited, the model, transferability (and broader use) is probably also limited.

Validation of a given ML model is critical for understanding its expected accuracy.

Interpreting ML models¶

ML models are designed to make highly accurate predictions. That may be sufficient.

What if we want to understand things?¶

Direct inspection of optimized weights is usually not productive. Most models are too complex.

More insights can be gained from ablation studies that remove input features one-by-one to analyze how the model performance after training changes.

It may also be possible to gain insights into what information is contained in the training data by challenging the trained model with unexpected input data.

Week 01 Lab: Python Refresher Course

Week 02 Lab: Machine Learning Basics

Machine Learning for Molecular Dynamics

Week 02 Lecture: Machine Learning at the Intersection with Molecular Simulations

Contents

Week 02 Lecture: Machine Learning at the Intersection with Molecular Simulations¶

A Brief Survey¶

What is the benefit of ML for simulators?¶

What is the benefit of MD for machine learners?¶

What is the idea of Machine Learning?¶

Typical tasks of ML¶

Regression¶

Data classification¶

Dimensionality reduction and feature extraction¶

Generation of high-dimensional data¶

Models used in ML¶

Functional regression¶

Logistic regression¶

Support vector machines¶

Decision Trees and Random Forests¶

Hidden Markov Models¶

Neural Networks¶

Neural Network Architectures¶

Activation functions¶

ML in practice¶

Define problem suitable for ML¶

Input data features¶

Target output data¶

Model design¶

Model training¶

How do we know which weights are optimal?¶

How do we find optimal weights?¶

How do we obtain gradients?¶

When do we stop with training?¶

How do we optimize training performance?¶

Data¶

Data biases and incorporation of additional knowledge¶

Using ML models¶

Interpreting ML models¶

What if we want to understand things?¶