Week 09 Lab: Analysis of MD trajectories with Markov State Models

Softwares used : PyEMMA and mdshare

pip install pyemma

pip install mdshare

conda list #The list should have pyemma, mdhare, matplotlib, numpy etc…

  • PyEMMA has a set of packages that can be used for analysis of MD data with Markov state models or even only to generate features or reduce the dimension of the data space.

  • The documentation of PyEMMA can be found here (keep it handy for the class): http://www.emma-project.org/latest/api/index.html

  • You will see a few packages are mentioned there (like coordinates, msm etc) with each having separate APIs. We will be using coordinates, msm and plots packages extensively.

  • This lab follows the PyEMMA examples found here: https://github.com/markovmodel/pyemma_tutorials.

# importing modules required for the class

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import mdshare
import pyemma
from pyemma.util.contexts import settings
import nglview
import mdtraj
from threading import Timer
from nglview.player import TrajectoryPlayer
  • Download the pdb (topology) and example simulation trajectories from the PyEMMA server of Prof Frank Noe’s group.

  • The system is pentapeptide model.

  • There are 25 separate small simulations (500 ns each), where the system is recorded at 0.1 ns. So each trajectory has (500/0.1)= 5000 frames

  • The downloaded data is stored into a folder (data) under your current working directory

# Download the pdb (topology) and files

pdb = mdshare.fetch('pentapeptide-impl-solv.pdb', working_directory='data')
files = mdshare.fetch('pentapeptide-*-500ns-impl-solv.xtc', working_directory='data')
  • View the system in nglview.

  • The stop_spin() function is to ensure that the view window closes after 60 seconds.

# View the system in nglview as you have done in previous class

widget = nglview.show_mdtraj(mdtraj.load(pdb))
p = TrajectoryPlayer(widget)
widget.add_ball_and_stick()
widget.add_line('water')
p.spin = True
def stop_spin():
    p.spin = False
    widget.close()
Timer(60, stop_spin).start()
widget
  • Selecting the features from the data using ‘coordinates’ package of pyemma.

  • This is where you will find all the APIs of pyemma.coordinate: http://www.emma-project.org/v2.4/api/index_coor.html

  • We will extract the features using pyemma.coordinates.featurizer function in the following code block.

  • Notice in the featurizer documentation webpage, you can have a lot of options to select from under the methods section. Some common choices are add_backbone_torsions, add_distances_ca etc. As we discussed in the last class, selecting feature requires knowledge and intuition about the data and the process that you are tageting to observe.

  • For example lets see how we create a feature of C\(\alpha\) distances.

#first we pass the topology to featurizer function 

distances_feat_ca = pyemma.coordinates.featurizer(pdb)
# Now we tell the featurizer which method to use to extract the features

distances_feat_ca.add_distances_ca(periodic=False, excluded_neighbors=1)
# Now we load all the data files with 'pyemma.coordinates.load' utility
# and store the features using the definition in the previous step.

ca_dist_data = pyemma.coordinates.load(files, features=distances_feat_ca)
#labels = ['C_alpha']
# TO DO
# What is the dimension of ca_dist_data?

Lets do the same thing with backbone torsions and positions of backbone atoms.

Backbone torsions are known to be better features for slow processes.

You will initialize the featurizers and create the feature list in the code block below.

Follow the previous example of C\(\alpha\) distances in both cases.

torsions_feat = #Fill up this space
torsions_feat.add_backbone_torsions(cossin=True, periodic=False)
torsions_data = #Fill up this space
labels = ['backbone\ntorsions']

positions_feat = #Fill up this space
positions_feat.add_selection(positions_feat.select_Backbone())
positions_data = #Fill up this space
labels += ['backbone atom\npositions']
##Fill up this space: 
# What is the dimension of torsions_data? 
# How many features are there for each frame?

Next, we compare the VAMP2 score. VAMP is variational approaches for markov process and VAMP2 score is a parameter to prove the effectiveness of the features by measuring the timescale of associated processes.

The following code block is taken from a PyEMMA example file.

The score_cv function compute a cross-validated VAMP2 score. We randomly split the list of independent trajectories into a training and a validation set, compute the VAMP2 score and repeat this process several times.

Parameters are:

data : list of numpy.ndarrays (the input data).

dim : int (Number of slow/important processes to score)

lag : int (lag time for the VAMP2 scoring)

number_of_splits : int, optional, default=10. How often do we repeat the splitting and score calculation.

validation_fraction : int, optional, default=0.5. Fraction of trajectories which should go into the validation set during a split.

def score_cv(data, dim, lag, number_of_splits=10, validation_fraction=0.5):

    # we temporarily suppress very short-lived progress bars
    with pyemma.util.contexts.settings(show_progress_bars=False):
        nval = int(len(data) * validation_fraction)
        scores = np.zeros(number_of_splits)
        for n in range(number_of_splits):
            ival = np.random.choice(len(data), size=nval, replace=False)
            vamp = pyemma.coordinates.vamp(
                [d for i, d in enumerate(data) if i not in ival], lag=lag, dim=dim)
            scores[n] = vamp.score([d for i, d in enumerate(data) if i in ival])
    return scores


dim = 10

fig, axes = plt.subplots(1, 3, figsize=(12, 3), sharey=True)

for ax, lag in zip(axes.flat, [5, 10, 20]):
    
    torsions_scores = score_cv(torsions_data, lag=lag, dim=dim)
    scores = [torsions_scores.mean()]
    errors = [torsions_scores.std()]
    positions_scores = score_cv(positions_data, lag=lag, dim=dim)
    scores += [positions_scores.mean()]
    errors += [positions_scores.std()]
    
    ax.bar(labels, scores, yerr=errors, color=['C0', 'C1', 'C2'])
    
    ax.set_title(r'lag time $\tau$={:.1f}ns'.format(lag * 0.1))
    
    if lag == 5:
        # save for later
        vamp_bars_plot = dict(
            labels=labels, scores=scores, errors=errors, dim=dim, lag=lag)
axes[0].set_ylabel('VAMP2 score')
fig.tight_layout()

Why do you think it is better to work with backbone torsions?

Hint: think about the dimensionality of the each feature space as well

In the next step, we will only modify the lag-time and the dimensions (number of slow processes) thing is to see how the score changes with respect to vamp2.

lags = [1, 2, 5, 10, 20]
dims = [i + 1 for i in range(10)]

fig, ax = plt.subplots()
for i, lag in enumerate(lags):
    scores_ = np.array([score_cv(torsions_data, dim, lag)
                        for dim in dims])
    scores = np.mean(scores_, axis=1)
    errors = np.std(scores_, axis=1, ddof=1)
    color = 'C{}'.format(i)
    ax.fill_between(dims, scores - errors, scores + errors, alpha=0.4, facecolor=color)
    ax.plot(dims, scores, '--o', color=color, label='lag={:.1f}ns'.format(lag * 0.1))
ax.legend()
ax.set_xlabel('number of dimensions')
ax.set_ylabel('VAMP2 score')
fig.tight_layout()

The figure above tells us that there is no more than four slow processes in data set.

  • Now we will reduce the dimensionality of the feature space but projecting it into a few ‘important’ dimensions.

  • Remember the basis functions we were talking about in the final few slides of the lecture.

  • This is an attempt to reach as close to the basis functions as possible using a mathematical tool called tICA (Time-Lagged Independent Component Analysis).

tica = pyemma.coordinates.tica(torsions_data, lag=5)
tica_output = tica.get_output()
tica_concatenated = np.concatenate(tica_output)
## Fill up this:

# How many dimensions are there in the projected data?
# Change the inputs to positions_data. 
# What are the changes in the number of dimensions?
# Revert back to torsions_data as we want that for msm building.

These many metastable basins are present with respect to the dynamical motion.

  • Plot the ICs with respect to feature values.

  • It shows that each IC has high density in a certain range of average feature values.

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
pyemma.plots.plot_feature_histograms(
    tica_concatenated,
    ax=axes[0],
    feature_labels=['IC1', 'IC2', 'IC3', 'IC4'],
    ylog=True)
pyemma.plots.plot_density(*tica_concatenated[:, :2].T, ax=axes[1], logscale=True)
axes[1].set_xlabel('IC 1')
axes[1].set_ylabel('IC 2')
fig.tight_layout()

Inspect one of the trajectories and see how the tICA components change with time.

Any slow transition is the dynamics should come up as a sharp jump in tICA space.

Do you see any? Comment on that.

fig, axes = plt.subplots(4, 1, figsize=(12, 5), sharex=True)
x = 0.1 * np.arange(tica_output[0].shape[0])
for i, (ax, tic) in enumerate(zip(axes.flat, tica_output[0].T)):
    ax.plot(x, tic)
    ax.set_ylabel('IC {}'.format(i + 1))
axes[-1].set_xlabel('time / ns')
fig.tight_layout()

Can you figure out how to take the 2nd frame of first trajectory for the above plot? Try that and see if things change.

Clustering

Let’s do clustering with the tICA outputs. We have reduced the dimension of 16 torsion features to 4 tICAs. Now we can define the states that we have talked about in the lecture.

We use kmeans clustering object from pyemma (the functions inside actually calls scikit-learn kmeans).

cluster = pyemma.coordinates.cluster_kmeans(
    tica_output, k=25, max_iter=5000, stride=1)
dtrajs_concatenated = np.concatenate(cluster.dtrajs)

Visualization of the clusters with respect to tICA.

You can see the small black dots (cluster centers of kmeans) are actually spanning all the densities of ICs (even the low desnity spaces also fall into clusters)

fig, ax = plt.subplots(figsize=(4, 4))
pyemma.plots.plot_density(
    *tica_concatenated[:, :2].T, ax=ax, cbar=True, alpha=0.3)
ax.scatter(*cluster.clustercenters[:, :2].T, s=5, c='Black')
ax.set_xlabel('IC 1')
ax.set_ylabel('IC 2')
fig.tight_layout()

Build an MSM with the clusters

Lag time is again 5 steps which means 5\(\times\)0.1ns = 0.5 ns

msm = pyemma.msm.bayesian_markov_model(cluster.dtrajs, lag=5, dt_traj='0.1 ns')
print('fraction of states used = {:.2f}'.format(msm.active_state_fraction))
print('fraction of counts used = {:.2f}'.format(msm.active_count_fraction))

How does the eigenvalues look? Try with different lag time in the above code and see how that effects.

msm.eigenvalues()

The eigenvectors corresponding to the slowest processes contain information about what configurational changes are happening on which timescales.

We analyze the slowest processes by inspecting the value of the first four eigenfunctions projected on two the first TICA coordinates.

Please keep lag=5 in the msm model building step to do this.

eigvec = msm.eigenvectors_right()
print('The first eigenvector is one: {} (min={}, max={})'.format(
    np.allclose(eigvec[:, 0], 1, atol=1e-15), eigvec[:, 0].min(), eigvec[:, 0].max()))

fig, axes = plt.subplots(1, 4, figsize=(15, 3), sharex=True, sharey=True)
for i, ax in enumerate(axes.flat):
    pyemma.plots.plot_contour(
        *tica_concatenated[:, :2].T,
        eigvec[dtrajs_concatenated, i + 1],
        ax=ax,
        cmap='PiYG',
        cbar_label='{}. right eigenvector'.format(i + 2),
        mask=True)
    ax.set_xlabel('IC 1')
axes[0].set_ylabel('IC 2')
fig.tight_layout()

We analyze the stationary distribution and the free energy computed over the first two TICA coordinates.

The stationary distribution is stored in msm.pi.

We compute the free energy landscape by re-weighting the trajectory frames with stationary probabilities from the MSM (returned by msm.trajectory_weights()).

fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharex=True, sharey=True)
pyemma.plots.plot_contour(
    *tica_concatenated[:, :2].T,
    msm.pi[dtrajs_concatenated],
    ax=axes[0],
    mask=True,
    cbar_label='stationary distribution')
pyemma.plots.plot_free_energy(
    *tica_concatenated[:, :2].T,
    weights=np.concatenate(msm.trajectory_weights()),
    ax=axes[1],
    legacy=False)
for ax in axes.flat:
    ax.set_xlabel('IC 1')
axes[0].set_ylabel('IC 2')
axes[0].set_title('Stationary distribution', fontweight='bold')
axes[1].set_title('Reweighted free energy surface', fontweight='bold')
fig.tight_layout()

To do:

Here is a code block for clustering with varibale number of cluster centers.

Work on the n_clustercenters (a list) and tell us what do you think is the optimum number of clusters for this tICA outputs of torsion features.

You can modify with max_itr and stride to see how that effects with figure as well.

Once you find an optimum number for the clustering, rerun the msm model building part.

n_clustercenters = #Fix this

scores = np.zeros((len(n_clustercenters), 5))


for n, k in enumerate(n_clustercenters):
    for m in range(5):
        with pyemma.util.contexts.settings(show_progress_bars=False):
            _cl = pyemma.coordinates.cluster_kmeans(
                tica_output, k=k, max_iter=500, stride=10)
            _msm = pyemma.msm.estimate_markov_model(_cl.dtrajs, 5)
            scores[n, m] = _msm.score_cv(#what will be the input here?, 
                n=1, score_method='VAMP2', score_k=min(10, k))

fig, ax = plt.subplots()
lower, upper = pyemma.util.statistics.confidence_interval(scores.T.tolist(), conf=0.9)
ax.fill_between(n_clustercenters, lower, upper, alpha=0.3)
ax.plot(n_clustercenters, np.mean(scores, axis=1), '-o')
#ax.semilogx()
ax.set_xlabel('number of cluster centers')
ax.set_ylabel('VAMP-2 score')
fig.tight_layout()