User’s Guide¶
Running Examples & Tutorials¶
Organization & Conventions¶
Examples are organized in the subdirectories in this folder. To allow for effective testing of the examples they follow a specific structure. If any example does not follow the rules in this document it can (and should if you are willing) be filed as a bug against this repository.
Each one has a README.org
document which gives some explanation on
the example and some tips on executing it. This should explain the
inputs, outputs, and any arguments, but not the “calling convention” of
them.
All scripts are expected to be run from the example directory, e.g. for
and example called myexample
you would run a script called
example.py
like:
cd info/examples/myexample
python ./source/myexample.py
All have a similar structure with these directories before execution:
source
Where the source code files are that will run the examples.
input
Any additional input files that will be loaded by the source script files.
env
A specification of the virtual environment that the examples are to be run.
Each example will also have a tasks.py
file which gives some useful
helper commands (i.e. scripts) that can be run by installing the python
package invoke
:
pip install invoke
Once its installed you can see the available commands which we call “targets”:
inv -l
You can run some initialization by running:
inv init
Making Environments for Running Examples¶
Examples and tutorials have a bad reputation for quickly going out of date. While we can’t control whether the content of an example is relevant or not, we should make them easy to execute far into the future.
To do this we have provided a folder with all of the necessary specifications for running the examples which are pinned to the exact version numbers as when they were tested. If you can get these versions of the software then the example should run.
The easiest way to build an environment is to just run this if you cloned the entire repository:
inv env
Environments can be made either with conda
or regular old venv
.
We prefer venv
but sometimes conda
is used because of
particularly complex dependencies.
The pyversion.txt
indicates which version of python the example
should be run with. To get different versions of python we recommend
either conda
or pyenv
.
The pinned specifications should be in the requirements.txt
and
env.pinned.yaml
files for pip
and conda
packages
respectively. If you would like to install dependencies manually read
the documentation for these tools on how to install dependencies with
these files. But typically it should be like this inside a conda
env
for the pinned versions:
pip install -r env/requirements.txt
conda env update --prefix env --file env/env.pinned.yaml
The abstract requirements (from which the pinned specs are compiled) are
in requirements.in
and env.yaml
which you can also try if you
want to get the newest versions of the software. Although we don’t
guarantee the example will work then.
pip install -r env/requirements.in
conda env update --prefix env --file env/env.yaml
Running Jupyter Notebook Tutorials¶
To get started check the instructions in the user manual first.
After you have made your environment, you can install some of the
Jupyter extensions for viewing structures by running the
post_install
target with invoke i.e.:
inv env
conda activate ./_env
inv post-install
jupyter-notebook
Then launch the jupyter notebook:
jupyter-notebook
Executing Examples¶
After you have created the virtual environment and tangled any source files if necessary you should be able to run the example.
As described above you should run scripts from the example dir and not
from the source
(or _tangle_source
) dirs since paths are
typically hardcoded for convenience.
I.e. run scripts like:
python source/script.py
python _tangle_source/script.py
./source/script.sh
Also hardcoded is where any file outputs go which should always be the
_output
directory.
Cleaning Up¶
You can run the targets:
inv clean clean_env
Or just know that any directory starting with an underscore ‘_’ is temporary and may be safely removed.
Tangling Literate Examples¶
In addition to the source
directory if the README.org
has any
source code in it this should be able to be “tangled” out of it into
source code files which are executable. These should end up in the
_tangle_source
folder. These examples are called “literate” because
the explanation of them is in the document alongside the code.
There should only be one copy of source code so that there is never any
discrepencies between them. That is there is never the same script (or
snippets of a script) in both the README.org
and in the source
directory.
You can “tangle” the source code by running:
inv tangle
This expects a local, fairly modern installation of emacs
to be
present. Although hopefully this restriction will be removed in the
future.
Then run them just like the source
scripts.
../_tangle_source/script.py
How to approach learning and using wepy¶
The wepy
project is a framework and library written in python for
performing WE simulations.
That means there is no special language to learn and use other than python itself. This also means you are able to harness the power of the entire language and ecosystem in prototyping methods and analyzing data.
To use prebuilt components all it takes is some basic knowledge of python.
With a little knowledge of object-oriented programming (OOP) you can utilize a collection of functionality to help you focus prototype your complex algorithms rather than worrying about how to save data etc.
The first question is what exactly do you need to write to be able to run a wepy simulation and how to run it.
The software configuration trap¶
Some people may not understand (or agree with) the approach to the
design of wepy
as it diverges from many packages in the field. So we
will provide a summary of the issues leading to the deliberately made
choices.
Many other tools utilize some sort of separate “configuration” file that is used to specify in a declarative manner what the simulation is supposed to do. Typically, these kinds of configurations are supposed to cater to beginners that have no experience in programming. Inevitably the features of the program start to expand and the complexity of configuration files grow until the developers have accidentally created an entire programming language, which invariably is very clumsy to program. Moreover, this language is a completely isolated ecosystem.
wepy
was designed to avoid this issue and from the start and
embraces a well-designed and highly popular programming language with a
huge standard library and massive collection of highle-trusted third
party libraries; Python (version 3 specifically). So while in wepy
you are still “configuring” your simulations by building components
there is no need to limit yourself to a small insular ecosystem in a
sub-par programming language designed by a chemist when punch cards were
still cutting edge.
Many older simulation engines fall into the traps described above simply because there were no obviously dominant high-level languages like python at the time. The implementers of these engines were pioneers and we admire them for the intrepid and forthright spirit that drove them to accomplish there scientific goals given the available infrastructure at the time. However, we try not to mistake particular implementations for the important traditions. As the success of a particular tradition in scientific modelling gains popularity, as molecular dynamics has, the requirements on implementation grows as many more diverse and previously unthunk use cases emerge for the better. We endeavour to highlight these successes as best and as honestly as we can.
As equally as we respect the scientific traditions we should also look to other disciplines such as computer science which have worked just as industriously to provide tools that enable us to actualize our ideas.
Python is the lingua franca of the scientific community as of the time of writing this, and the developers feel this is for the better.
The insular ecosystem problem is essentially solved by the use of a
general-purpose programming language such as Python. For example, a
massive collection of numerical algorithms and linear algebra can be
used in your component as easily as import numpy
. Furthermore,
anyone can use or incorporate wepy into their project with a simple
import wepy
.
Furthermore, wepy
is only as system dependent as far as python is
(and consequently C). It doesn’t rely on any environmental variables for
its functioning or other special operating specific details, (excepting
performance optimizations). Everything is contained within python files.
Some basic know-how of environments is assumed to be able to even
install and configure python, but is completely orthogonal to wepy
itself.
If you wish to use some OS-specific or legacy programs, there is the
python subprocess
module that allows for doing just that, and
perhaps a wrapper package ready to be imported.
To run wepy you simply write a file that runs a simulation and then run it on the command line:
python run_wepy.py
The use of pure python files for configuring may seem complicated to someone that doesn’t know python, but when you are running complex simulations it is comforting to know that there is no magic involved.
If you are dissappointed that wepy does not follow the “Unix” philosophy, there is a simple explanation. The so-called “universal” textual interface is wholly unsuited to storing any kind of molecular simulation data (much less snapshots of simulations) and so right at the start you are stuck with the decision of what your binary format is.
Intended audience and some suggestions
With all of this in mind
wepy
is meant to be a tool for a fairly expert user that is willing to write python code and understand some basic software patterns; that is the primary users ofwepy
are the developers. We want to make “complex things possible” first and secondarily “simple things simple”. As of the1.0
release we believe that “complex things possible” has largely been achieved and now the focus is making particular use-cases more streamlined as well as broadening the number of domains addressed.That said there is no reason that specific applications of wepy could not eventually be provided with some sort of easy-to-use configuration file format, command-line or web application.
We warn anyone that wants to do this to have a very clear idea of the scope of this application and keep the above comments in mind. Some suggestions on tools for such application/configurations are:
using the orchestration sub-packages Snapshot and Configuration objects for setting up and serializing (storage via python pickles) specific simulation initial conditions for short term (as updates to the code may break these objects).
A very simple, well-specified, and human-writable format such as TOML that has simple translations to all programming languages.
A simple and popular template engine such as Jinja for generating python scripts.
Some anti-examples would be:
ad hoc unspecified configuration files
human written XML files
operating system specific environmental variables
metaprogramming: such as python metaclasses
Snapshot and Configuration files
The
orchestration
sub-package, as mentioned, does come with a serialization format for simulation snapshots and configurations. This is an advanced general purpose tool that is primarily meant for managing large numbers of interconnected simulations and for adding in checkpointing capabilities for hardware that tends to crash.The snapshot and configuration files use python pickling to be able easily serialize them which should be used with caution. Pickle files should be read with the exact same version of the software that wrote them, otherwise complete and utter loss of data should be expected. It is possible that updates to the code will not effect the readability, but should never be assumed. So this is a very short term solution to storage.
This sub-package will be discussed later as an advanced topic in depth.
Framework or library?¶
At first a pedantic distinction, but understanding which parts of wepy are the “framework” and which parts are the “library” (and which parts are both) should eliminate some confusion (and hopefully lead to a sympathizing elightenment).
It helps to understand that wepy fulfills a few roles:
wepy provides components that can be used together
wepy defines interfaces that new components may implement so that they can be used with existing components
Because python is dynamically typed (AKA duck typing) interfaces and object identities (types) are not explicitly enforced. In fact the interfaces cannot even be expressed in python syntax. Instead we rely on the following sources to determine interfaces in order of precedence:
the original developers intention
the source code
the source code docstrings
the source code comments
the written documentation
…
a person on the internet
The developers ultimately have the final say because there will always be bugs in implementation and mistakes in documentation.
Developers make mistakes and are willing to fix them, if you think they have made a mistake and there is a bug or problem in an interface please reach out for clarification before assuming it was not already though of. If you are still convinced of the issue, prepare an argument to be made for your case for a change in an interface and prepare for it to be questioned and critiqued in an objective manner.
wepy: the framework
wepy
is a framework, in that, you execute it by configuring an “engine” that then takes your configurations and uses them on your behalf. So your goal as a wepy user is to obtain the necessary components and give them to the engine (the simulation manager) which will run the entire workflow.For the learned software engineers out there
wepy
uses dependency injection (AKA inversion of control).First we assemble our components by whatever means necessary:
from some.sim.engine import SimRunner from some.resampler import SomeResampler runner = SimRunner() resampler = SomeResampler()
Then we configure the engine by passing it the things it needs (injecting the runner and resampler dependencies):
from wepy.sim_manager import Manager # create initial walkers... init_walkers = make_walkers() sim_manager = Manager(init_walkers, runner=runner, resampler=resampler)
This gives the
sim_manager
object control of our components. It expects them to have methods for running simulations and resampling. We instruct thesim_manager
to run a simulation:results = sim_manager.run_simulation(...)
And we get back our results.
In the future examples we will also give the simulation manager a collection “reporters” which will produce rich datasets as “side-effects” of the simulation. In practice, running simulations isn’t very useful without them, but for simplicity we leave them out for now.
So we see that the core framework of wepy is actually very simple and really only consists of the
Manager
object and the interface it defines (by calling specific methods of the components you gave it). The simplicity of this makes it very flexible to use with other components.wepy: the library
It is a library in the sense some of these components come freely available and off the shelf for you to use, perhaps with the tweaking of a few simple parameters.
For example you can import the WExplore resampler simply by writing:
from wepy.resampling.resamplers.wexplore import WExploreResampler
The nonsense example above does this to retrieve some components from some library not included in wepy.
Thus, wepy also includes a library of generally useful components that can be used to configure the simulation manager.
The main categories of library components used by the sim manager are:
runners
resamplers
distance metrics
boundary conditions
reporters
work mappers
There is also some libraries related to storage and analysis of WE simulation data that are not used as injected dependencies of the sim manager:
HDF5 storage
analysis
Some examples of off-the-shelf components are:
These only need to be imported and constructed, e.g.:
from wepy.runners.openmm import OpenMMRunner runner = OpenMMRunner(system, topology, integrator, platform='Reference')
wepy: the library of frameworks
As a reminder classes in OOP define a type of object, similar to how the type
Int
orstr
defines a class of possible values. To create an object from class you are said to construct it.Objects (and classes) are essentially containers for both code and data. The code of an object or class is a method, and the data is an attribute.
Classes can be created in two ways. The developer writes it defining how to construct it and what the methods and attributes are. The second way is to make a class out of another class. Classes made from other classes in OOP is called inheritance and all functionality of a super-class is shared by the sub-class.
Abstract base classes (ABC) are classes that are available for the sole purpose of making other classes from. Sub-classes of ABCs in turn can either be abstract again, or concrete.
A subclass that inherits from an abstract class only becomes concrete (and thus usable like the OpenMMRunner component) when it is customized which are additional code written by the developer.
Customizations can either add orthogonal functionality, such as new methods or attributes, or they can override functionality from a super-class, changing the behavior or type of an existing method or attribute.
The goal of all this is ultimately is two-fold:
reduce the amount of code you need to write
fix problems only once
Lets look at a real example in the
wepy.resampling.distances.receptor
module.First there is the definition of the class
ReceptorDistance
:from wepy.resampling.distances.distance import Distance class ReceptorDistance(Distance): """Common abstract class for receptor-ligand molecular systems.""" def _unaligned_image(self, state): box_lengths, box_angles = box_vectors_to_lengths_angles(state['box_vectors']) grouped_positions = group_pair(state['positions'], box_lengths, self._bs_idxs, self._lig_idxs) # then center them around the binding site centered_positions = center_around(grouped_positions, self._bs_idxs) state_image = centered_positions[self._image_idxs] return state_image def image(self, state): state_image = self._unaligned_image(state) sup_image, _, _ = superimpose(self.ref_image, state_image, idxs=self._image_bs_idxs) return sup_image
We see that ReceptorDistance is inheriting from the
Distance
class, which is defined as such:class Distance(object): """Abstract Base class for Distance classes.""" def __init__(self): """Constructor for Distance class.""" pass ...
All classes ultimately inherit from
object
.Also ignore the
self
keyword for now, except to note that all object methods must have them.The method
__init__
is how this class constructs an object. You can think of this:dist = Distance()
as being equivalent to:
dist = Distance.__init__()
We notice that the
Distance
class defines this methoddistance
:class Distance(object): ... def distance(self, state_a, state_b): ...
But this is not written under
ReceptorDistance
. BecauseReceptorDistance
inherits fromDistance
it also inherits thedistance
method. So while it is not written underReceptorDistance
it still has access to it.We notice that
Distance
also defines the methodimage_distance
:class Distance(object): ... def image_distance(self, image_a, image_b): raise NotImplementedError
But this will always raise the
NotImplementedError
exception, which makes it an unusable function. This is because the method is defined merely as an example of what animage_distance
method should look like; it is an interface definition.Because only the
image_distance
interface is defined and not its implementation the wholeDistance
class is labelled abstract. Despite it having some functioning methods likedistance
.The
ReceptorDistance
class customizesDistance
in a couple ways. It re-implements the__init__
andimage
methods (overriding), adds the_unaligned_image
method, and inherits the abstractimage_distance
. So it has added some valuable methods but is still abstract becauseimage_distance
still raises theNotImplementedError
.Concrete sub-classes of
ReceptorDistance
are defined byUnbindingDistance
andRebindingDistance
.class UnbindingDistance(ReceptorDistance): def image_distance(self, image_a, image_b): lig_rmsd = calc_rmsd(image_a, image_b, idxs=self._image_lig_idxs) return lig_rmsd class RebindingDistance(ReceptorDistance): def image_distance(self, image_a, image_b): state_a_rmsd = calc_rmsd(self.ref_image, image_a, idxs=self._image_lig_idxs) state_b_rmsd = calc_rmsd(self.ref_image, image_b, idxs=self._image_lig_idxs) d = abs(1./state_a_rmsd - 1./state_b_rmsd) return d
In both of these only the
image_distance
method is implemented.Whats important to notice is that both
UnbindingDistance
andRebindingDistance
have the capabilities to run thedistance
,image
, etc. methods that were defined in their common superclasses.So not only is the amount of code written for these classes small and focused on the task at hand (calculating the distance between two walker images), but if there are any bugs in the shared code, say in
_unaligned_image
, then when it is fixed they both will be fixed. They both will also break when something in a superclass breaks, but this can be seen as a good thing because bugs will be found faster.So when you import an abstract base class to use as the foundation to build other classes you are importing a framework from a library.
For example, when we imported
Distance
above for theReceptorDistance
we wanted to build something that looks, talks, and quacks the same as aDistance
class but that adds some new and interesting functionality to it. The methods that are inherited may be expected by other components in a framework or they may be only for internal use only. In any case, inheritance is easy, easily overriden, and can make interfacing more seamless.So in this sense the
wepy
project contains not only the mainsim_manager
framework but a number of sub-frameworks that aid in constructing the main components.
What kind of wepy user are you?¶
Understanding what you want to accomplish with wepy can help you understand which parts to pay attention to and which to ignore.
I’ve outlined some possible goals a user might have with wepy in order of least to most expertise needed:
Recreate results found in a paper published using wepy.
Evaluate the utility of WE to enhance the sampling of my equilibrium OpenMM MD simulations of biomolecules on a small scale.
Run MD-WE simulations on a large scale.
Run a more complex MD simulation involving non-OpenMM dynamics/sampling engines, non-equilibrium simulations, or research, development, and prototyping of novel resampling algorithms.
Change, fix, or contribute a major feature to
wepy
itself.
In order:
Recreate results found in a paper published using wepy
If you are looking at wepy for the first time and just want to see what WE is all about and maybe play with the output to try and understand the kind of data that is produced, you should start with an example that recreates a published result. For that we provide examples and tutorials for you to run and perform some standard analysis with. Getting hands-on experience with the resulting data structures (walker resampling family trees etc.) is a great way to understand WE as it is quite different from normal, linear MD simulations.
Evaluating WE for your system
If you are interested in a WE algorithm (WExplore, REVO, etc.) because you read a paper and thought it could be applied to your system of interest you will want to run it to evaluate if it looks promising. For this you will want to follow a tutorial to get your system set up. For now
wepy
only comes with built-in support for OpenMM MD simulations so the first step is to follow the Openmm documentation to set up an MD simulation. OpenMM has support for most force fields. This is easily the most difficult part of the process. Once you have working MD simulations you will only need create a distance metric that characterizes the type of behavior you want to enhance in your simulations, if there is not already one available in a wepy or contributed library .Running wepy on a large scale
If you have been succesful at running
wepy
but find yourself:overloaded with managing too many simulation results
an excess of copy-pasted and tweaked scripts with increasingly complex filenames or directory structures
want to run multiple simulations that are continuations of each other
find you jobs failing and losing all your progress
You will probably want to start using some more advanced orchestration features of wepy and data aggregation methods in the
WepyHDF5
.Advanced or custom simulation requirements
If you need to:
run simulations with another kind of dynamics engine
implement boundary conditions for non-equilibrium simulations
implement or prototype a new resampler
implement a new reporter
implement a new work mapper for distributed or parallel computing
All of this can (ideally) be done without having modify the core
wepy
code base.First check if there is a similar contributed project that you could use, or contribute to yourself. Otherwise you are free to implement your component however you please; as a standalone importable module or directly in your run script.
If you think think the feature is general-purpose enough to request the devs to implement it you can make a feature request on the issue tracker.
If you need help implementing the feature, read on, or contact the devs for some advice.
If you think that there is something missing in the core library that is necessary for implementing the feature you can make a feature request, although we may end up just pointing you to an existing mechanism.
If you want other people to know about your creation we can add it to the contributed packages list if it is a proper module. If it is a bit rougher but still useful we can add it to the developers resources page.
Contribute to wepy itself
As mentioned in the last section if you need to have some changes made to core
wepy
to implement your new component you can make a feature request or you can submit the merge/pull request yourself!We are also open to the eventual inclusion of popular and mature contributed modules to the wepy core library if you want to fold in the maintenance of those modules to core.
Overview of frameworks¶
Simulation manager¶
The simulation manager framework can be configured with the following components:
initial walkers: the initial walkers (weights and states) to start a simulation with, can be from a single starting point or the final walkers from a previous simulation.
runner: the dynamics (or sampling) engine, which acts on the initial walkers
boundary conditions: specify walker modifications (warping) based on rules which is outside of the runner dynamics
resampler: performs the resampling through cloning and merging
reporters: generate data as side effects based on the behavior of the other components.
work mappers: the mechanism by which the work of the runner is achieved, i.e. achieving parallelism.
The simulation manager can also be made to run simulations by different methods which can be seen in the API documentation.
Here we discuss how the simulation manager actually achieves the work of running a simulation and how the components are actually used.
In your simulation script you will configure a Manager
object which
contains the components for running the simulation, by constructing a
Manager
object.
from wepy.sim_manager import Manager
from my_wepy_components import *
sim_manager = Manager(...)
Once the Manager
has been constructed we run a simulation by
repeatedly calling the run_cycle
method. Before doing this though we
must initialize the contexts for a single “run” of a simulation. This is
achieved through the init
method, which triggers the components
which have runtime setup routines to do that. This is primarily for
reporters to open file handles and initialize run data and for the work
mapper to start worker processes.
Once, the run has been initialized we can call run_cycle
how we
like:
# for the first cycle we use the initial walkers
start_walkers = sim_manager.init_walkers
n_steps = 1000
sim_manager.init()
for i in range (10):
# run a full cycle
end_walkers, components = sim_manager.run_cycle(start_walkers, n_steps, i)
start_walkers = end_walkers
sim_manager.cleanup()
In this example we run 10 cycles of 1000 steps each. We also call the
cleanup
method which allows components to gracefully teardown, such
as closing files, flushing buffers, and stopping processes.
There are a couple of builtin methods to do this for you, but its worth
showing that the run_cycle
method is where the real magic happens.
For example:
n_cycles = 10
steps = [1000 for i in range(n_cycles)]
walkers, components = sim_manager.run_simulation(n_cycles, steps)
Achieves the same as the example above.
The walkers
output of the run_cycle
is what you expect it to be.
The components
output is a collection of the various components that
may have been mutated as well during the simulation.
For example resamplers like WExplore are history dependent and stores copies of walker states (as compressed images) in a tree of regions which is stored in the resampler object.
These components are modified in place when called in run_cycle
(this is not a purely functional process as their is no input of
components to run_cycle
) but we return a reference of them each
cycle so you don’t need to introspect the Manager
object.
Utilizing the components at this level is beyond the scope here and is primarily for supporting orchestration facilities.
Another example is running timed simulations:
end_walkers, _ = sim_manager.run_simulation_by_time(3600, 1000)
This runs simulations for roughly one hour with 1000 steps per cycle. Currently, this is implemented by checking the total runtime at the beginning of a cycle and if the runtime has exceeded this time then the run loop is exited. If you use this option understand that you will need to give yourself enough time over this time to run 1 whole cycle (if you have only 1 second left on the clock the cycle will run and no guessing is done) plus the teardown routines.
The run_cycle
method has several steps and it is very important to
understand the order in which the components are executed to be able to
reason about your simulation results, even if you are not implementing
any components.
Secondarily, understanding which methods of which components are called and with what arguments is the de facto interface definition for those components.
It is also helpful to know what your options are for storing state during a simulation.
For example, should we write data out with a reporter or save it in the resampler object? Should a piece of state be carried in the runner, boundary condition (BC), resampler, or sim manager?
The execution of run_cycle
is as follows:
runner.pre_cycle(walkers, n_segment_steps, cycle_idx)
run_segment(walkers, n_segment_steps)
->work_mapper.map(runner.run_segment)
runner.post_cycle()
boundary_conditions.warp_walkers(walkers, cycle_idx)
(if present)resampler.resample(walkers)
reporter.report(**report)
for all reporters
At a high level there are two categories of components: apparatus and configuration.
The runner, BC, and resampler are part of the apparatus. The work mapper and reporter are part of the configuration.
The apparatus represents state that is meaningful in relation to the content of the simulation (e.g. accumulated region definitions in WExplore). The configuration is only related to how the simulation is run in a particular situation.
For example, if you want to restart a simulation and do another run startng at the end another, it is not enough to only copy over the walkers at the end of the simulation (unless your components are stateless). If you have defined a thousand regions with WExplore then you need to have that information at the start of a run.
The configuration only deals with the reporters and work mappers; and because neither of these can effect the actual content there is no need to keep a copy of them at the end of a simulation in order to be able to restart it. In fact, for reporters it is likely that you will want to avoid this since if paths are the same then you could potentially overwrite data.
Separating apparatus and configuration allows for the snapshotting of simulation state separate from details about how the simulation was actually run. For instance lets say you run one segment of a simulation on a node with 4 GPUs and then some time later you want to continue that run, but you only have access to a node with 2 GPUs, then you only need to reparametrize the configuration to handle that. Another use case is that you can add or remove reporters between runs without effecting the apparatus.
These topics are discussed in more detail in the documentation on orchestration since it uses these concepts for actually producing artifacts for snapshots and configurations. The distinction, however, is still useful here because we clearly see which components effect simulations.
Lets start with the apparatus components since without these you won’t be needing the configuration.
Walkers and WalkerStates
The topic of what a
Walker
object is, is very simple. It is simply a container which holds a state and a weight. The weight is a simple float value, which is assumed to be normalized with the rest of the weights of walkers in an ensemble (a simple list container).The implementation is very simple:
class Walker(object): def __init__(self, state, weight): self.state = state self.weight = weight
You can see that there is really just those two attributes.
The state part of the walker however is a bit trickier to define. This partially stems from the fact that representation of simulation state in various dynamics engine is wildly different and impossible for the simulation manager itself to handle all the variants.
Furthermore, the number of possible applications that require distinct kinds of states is not possible to specify up front in any case. For instance molecular dynamics is fairly uniform in that you typically only have to worry about atomic positions and velocities in the state and cubic box vectors. However, modern enhanced simulations use a wide variety of techniques that add all kinds of additional state such as alchemical lambda variables.
This required the definition of a common general purpose and extensible specification of how to represent them for use in
wepy
.For this simple key-value store semantics was chosen, where keys are strings. With the addition of one method
dict()
which transforms the object into pure python dictionary, and a constructor which takes values as key-word arguments. Anything that provides python like dictionary syntax and thedict()
method can be considered to implement theWalkerState
interface and will be called such even if it doesn’t directly inherit from the actualWalkerState
class.The implementation is very simple:
class WalkerState(object): def __init__(self, **kwargs): self._data = kwargs def __getitem__(self, key): return self._data[key] def dict(self): """Return all key-value pairs as a dictionary.""" return self._data
Where the
__getitem__
magic method implements the behavior for the square bracket access:state = WalkerState(thing='hello', other_thing=np.array([0,1,2,3])) arr = state['other_thing'] state_dict = state.dict()
You can always just dump your state from whatever simulation engine into a
WalkerState
and be on your merry way:state_dict = {'positions' : ..., 'velocities' : ...} state = WalkerState(**state_dict)
The sim manager takes care of copying walkers when it needs to copy them so you don’t have to worry about returning copies or references to internal data such as the
_data
attribute in theWalkerState
class.This interface also supports wrapping state objects from other engines. This may just be a constructor with a positional argument for one of these states:
class MDEngineWalkerState(): def __init__(self, md_state, **kwargs): self._state = md_state self._data = kwargs def __getitem__(self, key): if key == 'positions': return self._state.getPositions() else: return self._data[key]
This approach requires no copying of the original state and makes the state actually accessible and retrievable is some other tool or library specifically needs that class.
From the simulation managers point of view this is all that matters for it to work properly. However, all the other components will expect certain properties to be present. For example, the
WepyHDF5
reporter will expect there to be a ‘positions’ attribute as in the above example.Probably you should have walker states specialize in terms of the class definition for the runner they are being used by rather than the “schema” of which attributes it will contain. This allows you to couple the runner and the state so that you can get some performance optimizations by carrying around the state without having to transform it every time you go between them. For instance, in the
OpenMMRunner
we get the state from theOpenMMState
roughly by:sim = openmm.Simulation(...) sim.context.setState(walker.sim_state)
Instead of:
sim = openmm.Simulation(...) sim.context.setPositions(walker['positions']) sim.context.setVelocities(walker['velocities']) sim.context.setBoxVectors(walker['box_vectors']) ...
Runners
The Runner is the component that actually runs the sampling that the weighted ensemble algorithm will be enhancing, via resampling.
As such this can be any type of stochastic dynamics or sampling algorithm such as Monte Carlo. Dynamics should be stochastic because trajectories need to be able to diverge following cloning events.
That is if you take deterministic dynamics and make a copy of one of those simulations, you will perform the same exact work in duplicate of which there is no point to do in parallell. Furthermore, it will be impossible to enhance sampling from resampling because we need to be able to capitalize on differences that arise between those cloned simulations.
A Runner in wepy is typically a wrapper around some other dynamics engine as they can be of considerable complexity and highly domain specific.
The principle method a Runner must implement is
run_segment
which takes a walker, a definition of how long to run that segment, called thesegment_length
, and possibly a set of key-value based arguments.This function should then return a single walker which has had it’s state updated according to those input parameters.
The principle runner in
wepy
is the OpenMM runner which essentially just does some initialization and then calls:simulation.step(segment_length)
to run the simulation segment.
Within the
Manager.run_cycle
method there is a call to a simulation manager method, also called,run_segment
. This,Manager.run_segment
method in turns calls therunner.run_segment
once for each walker in the current ensemble. This is simply the common semantics ofmap
function which takes a single function and applies it to multiple pieces of data.The behavior of how this is achieved is encapsulated within the
work_mapper
object. Forwepy
awork_mapper
must simply have a method calledmap
that has the same function signature as the python built-inmap
, except that the function to be called is an attribute of the object.Basically, the work mapper is called as such:
new_walkers = list(self.work_mapper.map(walkers, (segment_length for i in range(num_walkers)), )
See the section on work mappers for more details on implementing them.
Two additional methods are also called for the runner in order to get a single call to the runner per cycle which are:
pre_cycle
andpost_cycle
. Call these if you have some state in the runner that needs to be updated outside of therun_segment
calls.Boundary Conditions
Boundary conditions (often abbreviated as BC) are extra conditions that are placed in the simulation that allow for executing extra rules about the transformation of walker states.
This is very useful for doing non-equilibrium simulations where once walkers have reached some predetermined condition or region the simulation is restarted in some original location. This allows for the calculation of rates from simulations.
BCs are strictly optional and conceptually could be implemented within the Runner itself. However, having them separate makes them more composable with different simulations. Furthermore, BCs are useful for reporting information on walkers as a simulation progresses that are not computed in the runner engine.
BCs are applied after runner steps are completed and is called basically as so:
warped_walkers, warp_data, bc_data, progress_data = \ self.boundary_conditions.warp_walkers(walkers, cycle_idx)
The name
warp_walkers
is meant to evoke the sense in which walkers are getting transformed according to something outside of the normal laws of physics the simulations implement. A typical example is non-equilibrium unbinding simulations (see theUnBindingBC
class) where walkers start with a state where a small ligand molecule is bound to a binding site in a protein and sampling proceeds until the molecule has left the binding site and moved away from the protein. At that point the boundary conditions recognize this and “warp” the walker so that it’s state is replaced with the original starting state.These events are recorded in the return
warp_data
object. Which is the first example of a record data type. So lets take a moment to describe those.In addition to the walkers there are a number of different pieces of data that are produced by the BCs and resampler components. These are documented fully in the developer’s architecture guide in terms of their formats. But suffice to say now that they all have a key-value or record oriented data definition that makes it much more convienent to implement storage layers, since they can all be essentially treated the same way except for their names. These records are vitally important to interpreting
wepy
simulation data because walker trajectories are no longer straightforward linear simulations, and may have various warping and merging events that destroy old states.The
warp_data
warping records are especially important because they tell you where and when simulations were respawned in non-equilibrium simulations which tells you how to reconstruct contiguous trajectories as well as how to calculate rates.One other possibility for warp records is that they do not actually “warp” the walker in the sense that they may mutate walker state attributes which are orthogonal to the dynamics engine. This can be used to implement “colored” dynamics where when a walker reaches some boundary an enumerated value (called the color) is changed to indicate the last boundary it crossed was. This color has no effect on MD propogation but is useful for calculating kinetics of the process while running what are essentially equilibrium simulations. Warping events that effect the same variables as the dynamics engine are often called “discontinuities”.
An optional interface a
BoundaryConditions
class can implement to determine whether a record indicates a discontinuity iswarping_discontinuity(warp_record)
which returns a boolean. This is used by some of the analysis routines to automatically obtain continuous trajectories or to show in tree graphs where exactly warping events occured.The other two record types are fairly accessory: BC records and progress records. The BC records are meant to allow for reporting on the changes in state of the boundary conditions as a simulation progresses. I am not aware of any practical use of this, but one could imagine changing the value of a cutoff as a function of simulation time.
The progress records are not critical to the functioning of the simulation but are a way to not waste values which are computed when checking for boundary conditions. Unlike warping and BC records progress records are produced every cycle once for every walker. For example, in
UnbindingBC
the minimum distance of a ligand to the protein is calculated every cycle to check whether any ligand has unbound. Instead of dropping these numbers on the floor we pass them through with the progress data and any reporter that is interested in them can report them.One can also imagine calculating values which are not necessary for making a decision to warp or not here, but we would caution that from a performance perspective that this is not wise since the
warp_walkers
call is blocking the progression of the simulation and creating overhead. The real bottleneck in terms of time is usually the dynamics (especially in the case of MD) and an implementer of any Runner, BC, or Resampler component should aim to make them efficient so as to be able to run as much dynamics as possible. Of course there is a tradeoff here and should be approached from the perspective of improving the performance of the metric you are looking for rather then raw MD throughput. Presumably, the reason you are using WE is that brute-force sampling is not fast enough to begin with.Calculating observables on WE data is very convient using the analysis tools in
wepy
. If you do want to calculate quantities on-the-fly for some reason this should be done in a reporter. This might want to be done if you aren’t storing the entire state on disk because it is too large but you still want to monitor some value that is a funtion of it. E.g. computing the average kinetic energy temperature from the velocities. Typically you don’t store every frame of velocities because it uses too much disk space, but you could compute the temperature in a reporter and just write that single number. Furthermore, while it currently is not the case now, it is possible to completely move reporting out of the critical path of the simulation so they do not block. This is possible because reporting is a pure side-effect of the simulation, but just requires a more complex concurrency architecture and fault tolerance.Resamplers
Resamplers are the heart and soul of
wepy
and are the loci of the actually complex and interesting algorithms.I will eschew a description of what purpose a resampler serves at this point as this is better described in a somewhat formal context. For more information see the resources in the introduction.
In terms of what a resampler component looks like and does can be quite distinct from some of the theoretical formulations. This freedom is the key to the flexibility of using
wepy
for prototyping new resampling algorithms.Minimally all a resampler must do is implement the
resample
method, e.g. the trivialNoResampler
is implemented like this:from wepy.resampling.resamplers.resampler import Resampler class NoResampler(Resampler) def resample(self, walkers, **kwargs): ... resampling_data = self._init_walker_actions(len(walkers)) ... return walkers, resampling_data, [{}]
where we just return the original walkers we were given. The additional return values are records related to the resampling records which report on how the cloning and merging took place (
resampling_data
) and the resampler records which report on state changes of the resampler itself.The resampling records here are just the default ones produced by the
_init_walker_actions
and there is not state for this resampler so we just produe a single empty record for that.The more important record types are the resampling records as they are what lets us reconstruct a family tree of walkers from cloning and merging. The resampler records on the other hand are just for monitoring of the resampler during the simulation and very specific to each resampler. The discussion of the field types and format of the resampling records is a bit involved and largely unnecessary to understand unless you are implementing a very specialized resampler.
If you are just using a resampler off of the shelf just know that these are saved in the
WepyHDF5
format and the various analysis tools will take care of all the mundane details of utilizing them.Reporters
Reporters are the primary mechanism for saving data about simulations.
As shown above you could just run a cycle on your own and introspect the objects and get the information you want. However, this would be specific to the implementation of each component. All reporters that are called from
run_cycle
can expect the same structure of data no matter the component that produced them.Besides the
init()
andcleanup()
methods each reporter must implement thereport
method which takes some key-word arguments. The key-value pairs that the manager passes to the reporters is the same, but each reporter chooses which ones it cares about.This dictionary collectively is called the report. Currently, it has these keys in it:
cycle_idx
n_segment_steps
new_walkers
warp_data
bc_data
progress_data
resampling_data
resampler_data
resampled_walkers
worker_segment_times
cycle_runner_time
cycle_bc_time
cycle_resampling_time
The ‘time’ fields are various timings that are made of the components for some performance reporting, and the rest have been discussed already.
This listing might change more frequently so if you are unsure check the source code.
Also when writing a
report
method always accept extra kwargs to handle new ones, e.g.:from wepy.reporter.reporter import Reporter class MyReporter(Reporter): def report(self, cycle_idx=None, n_segment_steps=None, cycle_resampling_time=None, **kwargs): ...
In addition to the ABC
Reporter
class theFileReporter
andProgressiveFileReporter
are very useful to inherit from as they handle some file path and file mode logic, the latter updates modes to allow for repeated writes to the same file for each cycle of a simulation.For example the
DashboardReporter
need only handle parameters specific to its own function and all the handling of filenames is done by a call to the superclass constructor:from wepy.reporter.reporter import ProgressiveFileReporter class DashboardReporter(ProgressiveFileReporter): def __init__(self, step_time=None, bc_cutoff_distance=None, **kwargs ): # handle filename(s) and mode(s) in the superclass super().__init__(**kwargs) # Dashboard logic ... self.step_time = step_time self.bc_cutoff_distance = bc_cutoff_distance
Work Mappers
The final component is the work mapper. As mentioned in the section on runners this is what actually achieves task parallelism over the walker’s dynamics segments.
The simplest and default mapper is the simple
Mapper
class. Basically, it works by first constructing it with the function you want to map (in the case of the simulation manager it automatically does this with therunner.run_segment
function) and then using a simple for-loop to sequentially compute the segments:class Mapper(object): def init(self, segment_func): self._func = segment_func def map(self, *args): args = [list(arg) for arg in args] results = [] for arg_idx in range(len(args[0])): result = self._func(*[arg[arg_idx] for arg in args]) results.append(result) return results
This is okay for test systems but for real simulations that take a long time we will need to use some sort of parallelism.
Currently, we provide a work mapper that uses a queue to put tasks on (the
run_cycle
plus the arguments) and worker processes fetch tasks off of the queue to perform whenever they are able and done with the next task. This is theWorkerMapper
class, which startsWorker
object processes using the python multiprocessing library.Because, we are using OS processes instead of “threads” it is truly parallel when using the CPython runtime, which uses the infamous Global Interpreter Lock (GIL). The GIL effectively makes it so that a single python process can never be multi-threaded or parallel, but will still let you program with thread semantics and maybe make you believe you are multi-threaded. It may be possible to use another python runtime like PyPy to get around this but this has not been tested.
Another note when using OS threads is that you will need to make sure you are creating processes in a way which is compatible with the dynamics engine runtime. For example, in linux systems you can make processes with a cheap
fork
syscall, or the more robust but more expensivespawn
syscall. When creating processes when a CUDA context has been defined, you must use thespawn
option (at least with OpenMM). This can be set in your run script like so:import multiprocessing as mp # set the process creation method mp.set_start_method('spawn') # useful tip for logging in multiprocessing: mp.log_to_stderr(logging.WARNING)
For different environments and runners you can use different worker types for customization if necessary. This is one case where inheritance is very important since inheriting from the
Process
base class is very important.For example, the OpenMM module defines two workers for either CPU (
OpenMMCPUWorker
) or GPU based workers (OpenMMGPUWorker
). The former allows you to specify the number of threads to use per CPU and the GPU worker just specifies which GPU device index to use.In the simulation managers call to
init
a worker process is created for each device that is present (CPU or GPU) and two queues are initialized, the work queue and the results queue. The worker processes then begin polling the queue for items. At the beginning of a cycle oneTask
object per walker is placed on the work queue and immediately the workers begin popping of tasks. Each worker then computes the walker-task and places the result onto the result queue then polls the queue again for new tasks, until they reach the end of the tasks. At the end of the cycle the main simulation manager process pops off the results from the result queue and structures them as walker states. At the end of a simulation a special “poison pill” is placed on the work queue for each worker which is a signal to shut down.
Resampling Framework¶
See the sub-package documentation: wepy/resampling/__init__.py
Simulation Data Persistence (WepyHDF5) and Analysis¶
We have discussed the components that are necessary to run a simulation using the simulation manager and the interfaces these components must implement.
The other half of the equation is to store the data associated with the
simulation and be able to analyze and transform that data. This is where
the WepyHDF5
format comes into play.
The module wepy.hdf5
has a class WepyHDF5
which defines an
interface for creating, accessing, and adding data to a single HDF5
format file which can be used for any wepy
simulation.
If you are not familiar with HDF5, it is a general purpose binary format
that is used for large amounts of structured numerical data. While
in-depth knowledge of how HDF5 works is not necessary to use the
WepyHDF5
class, it definitely makes sense to at least get an
overview of the performance and memory behaviors. For this I suggest
just going through the documentation for the
h5py library and the book “Python
and HDF5: Unlocking Scientific Data” by Andrew
Collette
who is also the original h5py
author.
WepyHDF5
uses h5py
under the hood and so if there is ever a
functionality that one of it’s methods doesn’t provide you can always
drop down and use it.
The main features of HDF5 are the existence are groups and datasets,
which are roughly equivalent to directories and files in common
hierarchical filesystems. The difference between datasets and files
being that HDF5 datasets must have explicit data shapes and types
(integers, floats, stings, etc.). Groups contain other groups and
datasets, and datasets make up the leaves of the tree. In h5py
groups and datasets also have string paths like files in order to access
them.
The core HDF5 library simply gives these building blocks to the
structure, while the WepyHDF5
class specifies and implements a
“schema” using these building blocks. So a WepyHDF5
is just any file
that has the same structure as one that would be constructed or read by
the WepyHDF5
class. An more in depth (but not formal) description of
this “schema” is given in the module API documentation.
Briefly though, the file is primarily organized by the concept of a run.
Each “run” contains all the data and metadata for a single wepy
simulation, that is after the call to Manager.init()
every
Manager.run_cycle
writes to the same run until we call
Manager.cleanup()
. Or a call to Manager.run_simulation
etc.
A run contains essentially two types of datasets: trajectories and records. Trajectories (including the initial walkers) are the results of the sampling step produced by the runner. A single trajectory is a group containing any number of “fields” which are just a single attribute of a frame of a trajectory. This typically includes the positions, box vectors, velocities (if given) and the temperature, volume, etc. for a typical molecular dynamics simulation. The records are the data produced by the various components like the resampler and boundary conditions (BCs). The meaning of these different record groups is discussed in the documentation for these components. Their storage in the HDF5 is the same however, and works again using any number of fields like the trajectories.
So the basic outline of an HDF5 file is:
runs
run: 0
trajectories
traj: 0
field:
positions
field:
box_vectors
…
resampling records
field:
decision
field:
target_idxs
…
warping records
…
run: 1
…
The primary way in which this file is created is by using the
WepyHDF5Reporter
. If there is only one reporter you should ever use
it is this one! Please see the tutorials and documentation for how to
fully make use of this reporter.
Once you have generated a WepyHDF5
file from a run (or many runs)
you will want to analyze the data. For this the wepy.analysis
sub-package is available along with some basic functions in the
WepyHDF5
API. The analysis package is intended to be limited to
functionality which works directly on the WepyHDF5
file or from a
set of records from a component. This is in order to not bloat wepy
with all manner of domain specific analysis tools that get overly
integrated to our own peculiar data structures. These tools provide a
way to transform a subset of your data into other formats like numpy,
pandas, networkx, and mdtraj. They also provide utilities for giving
different views onto the data so that excessive copying of the
trajectory data is not needed.
The most useful method is the WepyHDF5.compute_observable
method,
which you pass a function to compute some sort of value over all of the
frames of your trajectories. These computed values can either be
returned to be used in some other context or written directly to the
file as a trajectory field. Writing it to the file has the advantage
that later transformation views on the file will always have direct
access to these “observables” fields without having to deal with complex
indexing schemes to use with external data.
The three primary “views” provided by the analysis modules are in the
contig_tree
, network
, and parents
modules.
The contig_tree
module introduces the notion of a contig (a term
borrowed from the genomic assembly community, but totally distinct
here). A contig in this sense is simply the concatenation of multiple
wepy
runs to form a single /contig/uous whole. The contig tree
is a more general expression of this and represents the actual tree (or
forest) of runs that are started from each other. For instance you could
do one run and then restart it in two distinct simulations, in which
case you now have a tree. This tree-like structure makes it difficult to
more difficult to peform sliding window calculations and other things
and so provides this special functionality.
But why add this extra layer of abstraction over top of runs? I don’t have bifurcating simulations so couldn’t we just keep concatenating frames to a single run and just analyze that?
The answer is yes I suppose you could do that if you want. However, this
use-case is not explicitly provided for in wepy
because we see the
unit of “run” as both the data produced and the time, place, and
machine(s) that it was computed on. A run should be produced by the
execution of a single script or job on a timesharing system like SLURM
or Torque. This allows for provenance of the units of execution,
otherwise you would need to keep an index of when which cycles of the
run were executed from which jobs. Furthermore, it supports immutability
of already completed work. Instead of modifying the chunks of data
inside the runs and potentially corrupting them, just keep adding new
runs which don’t touch the other ones. Our typical workflow is to
produce a single run in a single file per job, and to never fiddle with
that file until it has been properly aggregated and archived. There are
tools in wepy
that aid in linking between files and aggregating
files so that a single WepyHDF5
object has access to data to many
other files.
The contig and contig tree are the conceptually complete unit of a “simulation”.
See the tutorials on how to make use of them.
The next data “view” is the wepy.analysis.network.MacroStateNetwork
.
Outside of resampling type enhanced sampling algorithms simulations are
very linear and so you always had one canonical way to go through the
data that makes sense. Of course when trajectories became very long the
practicality of this is challenged and so various techniques for
reducing the dimensions are used such as clustering and things like
Markov State Models (MSMs). These representations are what we call
Conformation State Networks (CSNs) or Macrostate Networks. They are
essentially networks in which the nodes are some sort of “macro-state”
that represents a collection of “micro-states”, and the edges represent
the observed transitions between the macrostates as determined from the
transitions between microstates seen from dynamics. The various names
indicate certain mathematical properties of the values of the edges and
nodes including rates and probabilities, but the structure is the same.
The MacroStateNetwork
class wraps a WepyHDF5
object and holds a
mapping of trajectory frame indices for each “macro-state” in a network.
This mapping can be automatically made by providing a field name from
the trajectories and each unique value will become it’s own macrostate.
Probably this field should be some sort of enumerated type like an
integer or string which can be calculated using the
compute_observable
method. Typically, this will be the result of
some clustering or MSM algorithm (supported by sliding window methods of
the ContigTree
).
The network is implemented as a networkx
directed graph and any of
the multitude of network and graph theory algorithms there can be
leveraged for analyzing your state network. Furthermore, using the
MacroStateNetwork
allows very easy introspection of the microstates
from any single macrostate. Finally, using the MacroStateNetwork
it
is trivial to produce transition probability matrices (edge matrices)
which can be used to calculate committor probabilities etc. from a
network. See the wepy.analysis.transitions
module for relevant
functions.
The network representation of WE data is particularly important because there really is no canonical ordering of frames within the walker cloning & merging family tree and so the natural representation is the state network. However, if you run a simulation with boundary conditions there is one meaningful linear representation which is the trajectory of walker that has crossed a boundary.
We call these linear representations of trajectory data from the entire contig traces. In order to be able to obtain traces we first have to use the resampling records to determine which walkers give rise to walkers in later cycles. Secondarily, we use the wapring records to determine if there were any discontinuous warping events that occur along these traces.
The primary object that abstracts the walker family tree is the
ParentForest
class (in wepy.analysis.parents
) and the Contig
class. See the tutorials for a complete example of how to use this. Some
of the more useful functions here are the
Contig.exit_point_trajectories
which generates a full lineage of
each walker that crosses a boundary. The ParentForest
provides a
networkx
directed graph of the tree which makes it amenable to the
algorithms available there. The parents.ancestors
gives a complete
lineage from any walker.
In addition to these basic views onto the underlying HDF5 dataset there are also a few analysis routines for calculating rates and free energy profiles which are a very common use case for simulations.
First using boundary conditions is often for the purpose of calculating
rates. The wepy.analysis.rates
module covers this.
The wepy.analysis.profiles
module covers generating free-energy
profiles and probability distributions for both the entire simulation
and as a series so you can easily see the convergence of a simulation
with relation to a given projection.
JSON Topology and Converting to Other File Formats¶
wepy
is not a molecular data file reader/writer of which there are
great many of. This topic can cause considerable headaches if not done
properly. For writing to file formats such as PDB, DCD, XTC, and all the
rest we rely on the mdtraj
library to satisfy this need. It probably
wouldn’t be too difficult to make a connector to another library if you
really need it so don’t think this is the only way.
The WepyHDF5
object, and analysis wrappers thereof, provide a number
of methods for generating mdtraj.Trajectory
and mdtraj.Topology
objects from stored data in various ways. See the API reference for a
full listing of options.
Read the mdtraj
documentation to get all of the options, but know
that it is as really simple as:
traj = wepy_hdf5.to_mdtraj(...)
traj.save_pdb('mymolecule.pdb')
It is worth noting that the JSON topology format that is used in
WepyHDF5
was actually taken from the HDF5 file format defined and
implemented in mdtraj
. The actual function for converting
mdtraj.Topology
objects to JSON and back again was a bit hidden so
we extracted it and provide them as utilities in wepy.util.mdtraj
:
json_to_mdtraj_topology
and mdtraj_to_json_topology
.
Another useful trick is that mdtraj
also has a converter to the
OpenMM topology object: mdtraj.Topology.to_openmm
and
mdtraj.Topology.from_openmm
. This comes in handy for serializing
your topologies to JSON after you create them in OpenMM.
There are several shortcomings in this JSON topology format in the opinions of this author, however after surveying all available topology format we have found it to be the most unambiguous and “programmable” format and so rely on it.
Of course different applications will different types “topologies”, no topologies, or altogether different system specifications and so the HDF5 format should not be seen as being tied to this format for molecular systems. It primarily provides a good substrate for generating other files which are needed by other programs.
That said there are no extensive libraries supporting it. However, there really isn’t any need since the parser is in the python standard library:
import json
top = json.loads(json_top_str)
where the top
object is just native python types making it easy to
do basic selections of atoms based on their atom or residue names and
types or the bond connectivity. If you want to do more complex things
like chemoinformatics or structural informatics you will want to cast
this to a purpose built representation. There are JSON parsers in just
about every language and so it is pretty portable in that sense.
We do provide a few useful functions that make working with it a tad
easier which are contained in the wepy.util.json_top
module. The
highlights there are functions for generating pandas.DataFrame
tables for either: atoms, residues, or chains; e.g.
json_top_atom_df
. The other being a function for getting a new
topology from a subset of atoms from the original json_top_subset
.
This is extremely useful for generating files for subsets of your entire
MD system and excluding things like waters.