User's Guide
============

Running Examples & Tutorials
----------------------------

Organization & Conventions
~~~~~~~~~~~~~~~~~~~~~~~~~~

Examples are organized in the subdirectories in this folder. To allow
for effective testing of the examples they follow a specific structure.
If any example does not follow the rules in this document it can (and
should if you are willing) be filed as a bug against this repository.

Each one has a ``README.org`` document which gives some explanation on
the example and some tips on executing it. This should explain the
inputs, outputs, and any arguments, but not the "calling convention" of
them.

All scripts are expected to be run from the example directory, e.g. for
and example called ``myexample`` you would run a script called
``example.py`` like:

.. code:: bash

   cd info/examples/myexample
   python ./source/myexample.py

All have a similar structure with these directories before execution:

``source``
   Where the source code files are that will run the examples.
``input``
   Any additional input files that will be loaded by the source script
   files.
``env``
   A specification of the virtual environment that the examples are to
   be run.

Each example will also have a ``tasks.py`` file which gives some useful
helper commands (i.e. scripts) that can be run by installing the python
package ``invoke``:

.. code:: bash

   pip install invoke

Once its installed you can see the available commands which we call
"targets":

.. code:: bash

   inv -l

You can run some initialization by running:

.. code:: bash

   inv init

Making Environments for Running Examples
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Examples and tutorials have a bad reputation for quickly going out of
date. While we can't control whether the content of an example is
relevant or not, we should make them easy to execute far into the
future.

To do this we have provided a folder with all of the necessary
specifications for running the examples which are pinned to the exact
version numbers as when they were tested. If you can get these versions
of the software then the example should run.

The easiest way to build an environment is to just run this if you
cloned the entire repository:

.. code:: bash

   inv env

Environments can be made either with ``conda`` or regular old ``venv``.
We prefer ``venv`` but sometimes ``conda`` is used because of
particularly complex dependencies.

The ``pyversion.txt`` indicates which version of python the example
should be run with. To get different versions of python we recommend
either ``conda`` or ``pyenv``.

The pinned specifications should be in the ``requirements.txt`` and
``env.pinned.yaml`` files for ``pip`` and ``conda`` packages
respectively. If you would like to install dependencies manually read
the documentation for these tools on how to install dependencies with
these files. But typically it should be like this inside a ``conda`` env
for the pinned versions:

.. code:: bash

   pip install -r env/requirements.txt
   conda env update --prefix env --file env/env.pinned.yaml

The abstract requirements (from which the pinned specs are compiled) are
in ``requirements.in`` and ``env.yaml`` which you can also try if you
want to get the newest versions of the software. Although we don't
guarantee the example will work then.

.. code:: bash

   pip install -r env/requirements.in
   conda env update --prefix env --file env/env.yaml

Running Jupyter Notebook Tutorials
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To get started check the instructions in the user manual first.

After you have made your environment, you can install some of the
Jupyter extensions for viewing structures by running the
``post_install`` target with invoke i.e.:

.. code:: bash

   inv env
   conda activate ./_env
   inv post-install
   jupyter-notebook

Then launch the jupyter notebook:

.. code:: bash

   jupyter-notebook

Executing Examples
~~~~~~~~~~~~~~~~~~

After you have created the virtual environment and tangled any source
files if necessary you should be able to run the example.

As described above you should run scripts from the example dir and not
from the ``source`` (or ``_tangle_source``) dirs since paths are
typically hardcoded for convenience.

I.e. run scripts like:

.. code:: bash

   python source/script.py
   python _tangle_source/script.py
   ./source/script.sh

Also hardcoded is where any file outputs go which should always be the
``_output`` directory.

Cleaning Up
~~~~~~~~~~~

You can run the targets:

.. code:: bash

   inv clean clean_env

Or just know that any directory starting with an underscore '_' is
temporary and may be safely removed.

Tangling Literate Examples
~~~~~~~~~~~~~~~~~~~~~~~~~~

In addition to the ``source`` directory if the ``README.org`` has any
source code in it this should be able to be "tangled" out of it into
source code files which are executable. These should end up in the
``_tangle_source`` folder. These examples are called "literate" because
the explanation of them is in the document alongside the code.

There should only be one copy of source code so that there is never any
discrepencies between them. That is there is never the same script (or
snippets of a script) in both the ``README.org`` and in the ``source``
directory.

You can "tangle" the source code by running:

.. code:: bash

   inv tangle

This expects a local, fairly modern installation of ``emacs`` to be
present. Although hopefully this restriction will be removed in the
future.

Then run them just like the ``source`` scripts.

.. code:: bash

   ../_tangle_source/script.py

How to approach learning and using wepy
---------------------------------------

The ``wepy`` project is a framework and library written in python for
performing :abbr:`WE (weighted ensemble)` simulations.

That means there is no special language to learn and use other than
python itself. This also means you are able to harness the power of the
entire language and ecosystem in prototyping methods and analyzing data.

To use prebuilt components all it takes is some basic knowledge of
python.

With a little knowledge of object-oriented programming (OOP) you can
utilize a collection of functionality to help you focus prototype your
complex algorithms rather than worrying about how to save data etc.

The first question is what exactly do you need to write to be able to
run a wepy simulation and how to run it.

The software configuration trap
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some people may not understand (or agree with) the approach to the
design of ``wepy`` as it diverges from many packages in the field. So we
will provide a summary of the issues leading to the deliberately made
choices.

Many other tools utilize some sort of separate "configuration" file that
is used to specify in a declarative manner what the simulation is
supposed to do. Typically, these kinds of configurations are supposed to
cater to beginners that have no experience in programming. Inevitably
the features of the program start to expand and the complexity of
configuration files grow until the developers have accidentally created
an entire programming language, which invariably is very clumsy to
program. Moreover, this language is a completely isolated ecosystem.

``wepy`` was designed to avoid this issue and from the start and
embraces a well-designed and highly popular programming language with a
huge standard library and massive collection of highle-trusted third
party libraries; Python (version 3 specifically). So while in ``wepy``
you are still "configuring" your simulations by building components
there is no need to limit yourself to a small insular ecosystem in a
sub-par programming language designed by a chemist when punch cards were
still cutting edge.

Many older simulation engines fall into the traps described above simply
because there were no obviously dominant high-level languages like
python at the time. The implementers of these engines were pioneers and
we admire them for the intrepid and forthright spirit that drove them to
accomplish there scientific goals given the available infrastructure at
the time. However, we try not to mistake particular implementations for
the important traditions. As the success of a particular tradition in
scientific modelling gains popularity, as molecular dynamics has, the
requirements on implementation grows as many more diverse and previously
unthunk use cases emerge for the better. We endeavour to highlight these
successes as best and as honestly as we can.

As equally as we respect the scientific traditions we should also look
to other disciplines such as computer science which have worked just as
industriously to provide tools that enable us to actualize our ideas.

Python is the *lingua franca* of the scientific community as of the time
of writing this, and the developers feel this is for the better.

The insular ecosystem problem is essentially solved by the use of a
general-purpose programming language such as Python. For example, a
massive collection of numerical algorithms and linear algebra can be
used in your component as easily as ``import numpy``. Furthermore,
anyone can use or incorporate wepy into their project with a simple
``import wepy``.

Furthermore, ``wepy`` is only as system dependent as far as python is
(and consequently C). It doesn't rely on any environmental variables for
its functioning or other special operating specific details, (excepting
performance optimizations). Everything is contained within python files.
Some basic know-how of environments is assumed to be able to even
install and configure python, but is completely orthogonal to wepy
itself.

If you wish to use some OS-specific or legacy programs, there is the
python ``subprocess`` module that allows for doing just that, and
perhaps a wrapper package ready to be imported.

To run wepy you simply write a file that runs a simulation and then run
it on the command line:

.. code:: bash

   python run_wepy.py

The use of pure python files for configuring may seem complicated to
someone that doesn't know python, but when you are running complex
simulations it is comforting to know that there is no magic involved.

If you are dissappointed that wepy does not follow the "Unix"
philosophy, there is a simple explanation. The so-called "universal"
textual interface is wholly unsuited to storing any kind of molecular
simulation data (much less snapshots of simulations) and so right at the
start you are stuck with the decision of what your binary format is.

#. Intended audience and some suggestions

   With all of this in mind ``wepy`` is meant to be a tool for a fairly
   expert user that is willing to write python code and understand some
   basic software patterns; that is the primary users of ``wepy`` are
   the developers. We want to make "complex things possible" first and
   secondarily "simple things simple". As of the ``1.0`` release we
   believe that "complex things possible" has largely been achieved and
   now the focus is making particular use-cases more streamlined as well
   as broadening the number of domains addressed.

   That said there is no reason that specific applications of wepy could
   not eventually be provided with some sort of easy-to-use
   configuration file format, command-line or web application.

   We warn anyone that wants to do this to have a very clear idea of the
   scope of this application and keep the above comments in mind. Some
   suggestions on tools for such application/configurations are:

   -  using the orchestration sub-packages Snapshot and Configuration
      objects for setting up and serializing (storage via python
      pickles) specific simulation initial conditions for short term (as
      updates to the code may break these objects).
   -  A very simple, well-specified, and human-writable format such as
      `TOML <https://github.com/toml-lang/toml>`__ that has simple
      translations to all programming languages.
   -  A simple and popular template engine such as
      `Jinja <https://github.com/pallets/jinja>`__ for generating python
      scripts.

   Some anti-examples would be:

   -  ad hoc unspecified configuration files
   -  human written XML files
   -  operating system specific environmental variables
   -  metaprogramming: such as python metaclasses

#. Snapshot and Configuration files

   The ``orchestration`` sub-package, as mentioned, does come with a
   serialization format for simulation snapshots and configurations.
   This is an advanced general purpose tool that is primarily meant for
   managing large numbers of interconnected simulations and for adding
   in checkpointing capabilities for hardware that tends to crash.

   The snapshot and configuration files use python pickling to be able
   easily serialize them which should be used with caution. Pickle files
   should be read with the exact same version of the software that wrote
   them, otherwise complete and utter loss of data should be expected.
   It is possible that updates to the code will not effect the
   readability, but should never be assumed. So this is a very short
   term solution to storage.

   This sub-package will be discussed later as an advanced topic in
   depth.

Framework or library?
~~~~~~~~~~~~~~~~~~~~~

At first a pedantic distinction, but understanding which parts of wepy
are the "framework" and which parts are the "library" (and which parts
are both) should eliminate some confusion (and hopefully lead to a
sympathizing elightenment).

It helps to understand that wepy fulfills a few roles:

-  wepy provides components that can be used together
-  wepy defines **interfaces** that new components may implement so that
   they can be used with existing components

Because python is dynamically typed (AKA duck typing) interfaces and
object identities (types) are not explicitly enforced. In fact the
interfaces cannot even be expressed in python syntax. Instead we rely on
the following sources to determine interfaces in order of precedence:

-  the original developers intention
-  the source code
-  the source code docstrings
-  the source code comments
-  the written documentation
-  …
-  a person on the internet

The developers ultimately have the final say because there will always
be bugs in implementation and mistakes in documentation.

Developers make mistakes and are willing to fix them, if you think they
have made a mistake and there is a bug or problem in an interface please
reach out for clarification before assuming it was not already though
of. If you are still convinced of the issue, prepare an argument to be
made for your case for a change in an interface and prepare for it to be
questioned and critiqued in an objective manner.

#. wepy: the framework

   ``wepy`` is a framework, in that, you execute it by configuring an
   "engine" that then takes your configurations and uses them on your
   behalf. So your goal as a wepy user is to obtain the necessary
   components and give them to the engine (the simulation manager) which
   will run the entire workflow.

   For the learned software engineers out there ``wepy`` uses dependency
   injection (AKA inversion of control).

   First we assemble our components by whatever means necessary:

   .. code:: python

      from some.sim.engine import SimRunner
      from some.resampler import SomeResampler

      runner = SimRunner()
      resampler = SomeResampler()

   Then we configure the engine by passing it the things it needs
   (injecting the runner and resampler dependencies):

   .. code:: python

      from wepy.sim_manager import Manager

      # create initial walkers...
      init_walkers = make_walkers()

      sim_manager = Manager(init_walkers,
                            runner=runner,
                            resampler=resampler)

   This gives the ``sim_manager`` object control of our components. It
   expects them to have methods for running simulations and resampling.
   We instruct the ``sim_manager`` to run a simulation:

   .. code:: python

      results = sim_manager.run_simulation(...)

   And we get back our results.

   In the future examples we will also give the simulation manager a
   collection "reporters" which will produce rich datasets as
   "side-effects" of the simulation. In practice, running simulations
   isn't very useful without them, but for simplicity we leave them out
   for now.

   So we see that the core framework of wepy is actually very simple and
   really only consists of the ``Manager`` object and the interface it
   defines (by calling specific methods of the components you gave it).
   The simplicity of this makes it very flexible to use with other
   components.

#. wepy: the library

   It is a library in the sense some of these components come freely
   available and off the shelf for you to use, perhaps with the tweaking
   of a few simple parameters.

   For example you can import the WExplore resampler simply by writing:

   .. code:: python

      from wepy.resampling.resamplers.wexplore import WExploreResampler

   The nonsense example above does this to retrieve some components from
   some library not included in wepy.

   Thus, wepy also includes a library of generally useful components
   that can be used to configure the simulation manager.

   The main categories of library components used by the sim manager
   are:

   -  runners
   -  resamplers
   -  distance metrics
   -  boundary conditions
   -  reporters
   -  work mappers

   There is also some libraries related to storage and analysis of WE
   simulation data that are not used as injected dependencies of the sim
   manager:

   -  HDF5 storage
   -  analysis

   Some examples of off-the-shelf components are:

   * :class:`wepy.runners.openmm.OpenMMRunner`
   * :class:`wepy.runners.openmm.OpenMMRunner`
   * :class:`wepy.resampling.resamplers.revo.REVOResampler`
   * :class:`wepy.resampling.distances.receptor.UnbindingDistance`
   * :class:`wepy.boundary_conditions.receptor.UnbindingBC`
   * :class:`wepy.reporter.hdf5.WepyHDF5Reporter`
   * :class:`wepy.reporter.dashboard.DashboardReporter`

   These only need to be imported and constructed, e.g.:

   .. code:: python

      from wepy.runners.openmm import OpenMMRunner

      runner = OpenMMRunner(system, topology, integrator,
                             platform='Reference')

#. wepy: the library of frameworks

   As a reminder classes in OOP define a type of object, similar to how
   the type ``Int`` or ``str`` defines a *class* of possible values. To
   create an object from class you are said to **construct** it.

   Objects (and classes) are essentially containers for both code and
   data. The code of an object or class is a method, and the data is an
   attribute.

   Classes can be created in two ways. The developer writes it defining
   how to construct it and what the methods and attributes are. The
   second way is to make a class out of another class. Classes made from
   other classes in OOP is called **inheritance** and all functionality
   of a super-class is shared by the sub-class.

   **Abstract base classes (ABC)** are classes that are available for
   the sole purpose of making other classes from. Sub-classes of ABCs in
   turn can either be abstract again, or concrete.

   A subclass that inherits from an abstract class only becomes concrete
   (and thus usable like the OpenMMRunner component) when it is
   **customized** which are additional code written by the developer.

   Customizations can either add orthogonal functionality, such as new
   methods or attributes, or they can **override** functionality from a
   super-class, changing the behavior or type of an existing method or
   attribute.

   The goal of all this is ultimately is two-fold:

   -  reduce the amount of code you need to write
   -  fix problems only once

   Lets look at a real example in the
   ``wepy.resampling.distances.receptor`` module.

   First there is the definition of the class ``ReceptorDistance``:

   .. code:: python

      from wepy.resampling.distances.distance import Distance
      class ReceptorDistance(Distance):
          """Common abstract class for receptor-ligand molecular systems."""

          def _unaligned_image(self, state):
              box_lengths, box_angles = box_vectors_to_lengths_angles(state['box_vectors'])
              grouped_positions = group_pair(state['positions'], box_lengths,
                                          self._bs_idxs, self._lig_idxs)
              # then center them around the binding site
              centered_positions = center_around(grouped_positions,
                                                 self._bs_idxs)
              state_image = centered_positions[self._image_idxs]

              return state_image


          def image(self, state):
              state_image = self._unaligned_image(state)
              sup_image, _, _ = superimpose(self.ref_image,
                                            state_image,
                                            idxs=self._image_bs_idxs)

              return sup_image

   We see that ReceptorDistance is inheriting from the ``Distance``
   class, which is defined as such:

   .. code:: python

      class Distance(object):
          """Abstract Base class for Distance classes."""

          def __init__(self):
              """Constructor for Distance class."""
              pass

          ...

   All classes ultimately inherit from ``object``.

   Also ignore the ``self`` keyword for now, except to note that all
   object methods must have them.

   The method ``__init__`` is how this class constructs an object. You
   can think of this:

   .. code:: python

      dist = Distance()

   as being equivalent to:

   .. code:: python

      dist = Distance.__init__()

   We notice that the ``Distance`` class defines this method
   ``distance``:

   .. code:: python

      class Distance(object):
          ...

          def distance(self, state_a, state_b):

              ...

   But this is not written under ``ReceptorDistance``. Because
   ``ReceptorDistance`` inherits from ``Distance`` it also inherits the
   ``distance`` method. So while it is not written under
   ``ReceptorDistance`` it still has access to it.

   We notice that ``Distance`` also defines the method
   ``image_distance``:

   .. code:: python

      class Distance(object):

          ...

          def image_distance(self, image_a, image_b):

              raise NotImplementedError

   But this will always raise the ``NotImplementedError`` exception,
   which makes it an unusable function. This is because the method is
   defined merely as an example of what an ``image_distance`` method
   should look like; it is an interface definition.

   Because only the ``image_distance`` interface is defined and not its
   implementation the whole ``Distance`` class is labelled abstract.
   Despite it having some functioning methods like ``distance``.

   The ``ReceptorDistance`` class customizes ``Distance`` in a couple
   ways. It re-implements the ``__init__`` and ``image`` methods
   (overriding), adds the ``_unaligned_image`` method, and inherits the
   abstract ``image_distance``. So it has added some valuable methods
   but is still abstract because ``image_distance`` still raises the
   ``NotImplementedError``.

   Concrete sub-classes of ``ReceptorDistance`` are defined by
   ``UnbindingDistance`` and ``RebindingDistance``.

   .. code:: python

      class UnbindingDistance(ReceptorDistance):

          def image_distance(self, image_a, image_b):
              lig_rmsd = calc_rmsd(image_a, image_b, idxs=self._image_lig_idxs)

              return lig_rmsd


      class RebindingDistance(ReceptorDistance):

          def image_distance(self, image_a, image_b):
              state_a_rmsd = calc_rmsd(self.ref_image, image_a, idxs=self._image_lig_idxs)
              state_b_rmsd = calc_rmsd(self.ref_image, image_b, idxs=self._image_lig_idxs)
              d = abs(1./state_a_rmsd - 1./state_b_rmsd)

              return d

   In both of these only the ``image_distance`` method is implemented.

   Whats important to notice is that both ``UnbindingDistance`` and
   ``RebindingDistance`` have the capabilities to run the ``distance``,
   ``image``, etc. methods that were defined in their common
   superclasses.

   So not only is the amount of code written for these classes small and
   focused on the task at hand (calculating the distance between two
   walker images), but if there are any bugs in the shared code, say in
   ``_unaligned_image``, then when it is fixed they both will be fixed.
   They both will also break when something in a superclass breaks, but
   this can be seen as a good thing because bugs will be found faster.

   So when you import an abstract base class to use as the foundation to
   build other classes you are importing a framework from a library.

   For example, when we imported ``Distance`` above for the
   ``ReceptorDistance`` we wanted to build something that looks, talks,
   and quacks the same as a ``Distance`` class but that adds some new
   and interesting functionality to it. The methods that are inherited
   may be expected by other components in a framework or they may be
   only for internal use only. In any case, inheritance is easy, easily
   overriden, and can make interfacing more seamless.

   So in this sense the ``wepy`` project contains not only the main
   ``sim_manager`` framework but a number of sub-frameworks that aid in
   constructing the main components.

What kind of wepy user are you?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Understanding what you want to accomplish with wepy can help you
understand which parts to pay attention to and which to ignore.

I've outlined some possible goals a user might have with wepy in order
of least to most expertise needed:

#. Recreate results found in a paper published using wepy.
#. Evaluate the utility of WE to enhance the sampling of my equilibrium
   OpenMM MD simulations of biomolecules on a small scale.
#. Run MD-WE simulations on a large scale.
#. Run a more complex MD simulation involving non-OpenMM
   dynamics/sampling engines, non-equilibrium simulations, or research,
   development, and prototyping of novel resampling algorithms.
#. Change, fix, or contribute a major feature to ``wepy`` itself.

In order:

#. Recreate results found in a paper published using wepy

   If you are looking at wepy for the first time and just want to see
   what WE is all about and maybe play with the output to try and
   understand the kind of data that is produced, you should start with
   an example that recreates a published result. For that we provide
   examples and tutorials for you to run and perform some standard
   analysis with. Getting hands-on experience with the resulting data
   structures (walker resampling family trees etc.) is a great way to
   understand WE as it is quite different from normal, linear MD
   simulations.

#. Evaluating WE for your system

   If you are interested in a WE algorithm (WExplore, REVO, etc.)
   because you read a paper and thought it could be applied to your
   system of interest you will want to run it to evaluate if it looks
   promising. For this you will want to follow a tutorial to get your
   system set up. For now ``wepy`` only comes with built-in support for
   OpenMM MD simulations so the first step is to follow the Openmm
   `documentation <http://openmm.org>`__ to set up an MD simulation.
   OpenMM has support for most force fields. This is easily the most
   difficult part of the process. Once you have working MD simulations
   you will only need create a distance metric that characterizes the
   type of behavior you want to enhance in your simulations, if there is
   not already one available in a wepy or
   :ref:`contributed library <resources>` .

#. Running wepy on a large scale

   If you have been succesful at running ``wepy`` but find yourself:

   -  overloaded with managing too many simulation results
   -  an excess of copy-pasted and tweaked scripts with increasingly
      complex filenames or directory structures
   -  want to run multiple simulations that are continuations of each
      other
   -  find you jobs failing and losing all your progress

   You will probably want to start using some more advanced
   **orchestration** features of wepy and data aggregation methods in
   the ``WepyHDF5``.

#. Advanced or custom simulation requirements

   If you need to:

   -  run simulations with another kind of dynamics engine
   -  implement boundary conditions for non-equilibrium simulations
   -  implement or prototype a new resampler
   -  implement a new reporter
   -  implement a new work mapper for distributed or parallel computing

   All of this can (ideally) be done without having modify the core
   ``wepy`` code base.

   First check if there is a similar contributed project that you could
   use, or contribute to yourself. Otherwise you are free to implement
   your component however you please; as a standalone importable module
   or directly in your run script.

   If you think think the feature is general-purpose enough to request
   the devs to implement it you can make a feature request on the issue
   tracker.

   If you need help implementing the feature, read on, or contact the
   devs for some advice.

   If you think that there is something missing in the core library that
   is necessary for implementing the feature you can make a feature
   request, although we may end up just pointing you to an existing
   mechanism.

   If you want other people to know about your creation we can add it to
   the contributed packages list if it is a proper module. If it is a
   bit rougher but still useful we can add it to the developers
   resources page.

#. Contribute to wepy itself

   As mentioned in the last section if you need to have some changes
   made to core ``wepy`` to implement your new component you can make a
   feature request or you can submit the merge/pull request yourself!

   We are also open to the eventual inclusion of popular and mature
   contributed modules to the wepy core library if you want to fold in
   the maintenance of those modules to core.

Overview of frameworks
----------------------

Simulation manager
~~~~~~~~~~~~~~~~~~

The simulation manager framework can be configured with the following
components:

-  **initial walkers**: the initial walkers (weights and states) to
   start a simulation with, can be from a single starting point or the
   final walkers from a previous simulation.
-  **runner**: the dynamics (or sampling) engine, which acts on the
   initial walkers
-  **boundary conditions**: specify walker modifications (warping) based
   on rules which is outside of the runner dynamics
-  **resampler**: performs the resampling through cloning and merging
-  **reporters**: generate data as side effects based on the behavior of
   the other components.
-  **work mappers**: the mechanism by which the work of the runner is
   achieved, i.e. achieving parallelism.

The simulation manager can also be made to run simulations by different
methods which can be seen in the API documentation.

Here we discuss how the simulation manager actually achieves the work of
running a simulation and how the components are actually used.

In your simulation script you will configure a ``Manager`` object which
contains the components for running the simulation, by constructing a
``Manager`` object.

.. code:: python

   from wepy.sim_manager import Manager

   from my_wepy_components import *

   sim_manager = Manager(...)

Once the ``Manager`` has been constructed we run a simulation by
repeatedly calling the ``run_cycle`` method. Before doing this though we
must initialize the contexts for a single "run" of a simulation. This is
achieved through the ``init`` method, which triggers the components
which have runtime setup routines to do that. This is primarily for
reporters to open file handles and initialize run data and for the work
mapper to start worker processes.

Once, the **run** has been initialized we can call ``run_cycle`` how we
like:

.. code:: python

   # for the first cycle we use the initial walkers
   start_walkers = sim_manager.init_walkers

   n_steps = 1000

   sim_manager.init()

   for i in range (10):

       # run a full cycle
       end_walkers, components = sim_manager.run_cycle(start_walkers, n_steps, i)

       start_walkers = end_walkers

   sim_manager.cleanup()

In this example we run 10 cycles of 1000 steps each. We also call the
``cleanup`` method which allows components to gracefully teardown, such
as closing files, flushing buffers, and stopping processes.

There are a couple of builtin methods to do this for you, but its worth
showing that the ``run_cycle`` method is where the real magic happens.

For example:

.. code:: python

   n_cycles = 10
   steps = [1000 for i in range(n_cycles)]

   walkers, components = sim_manager.run_simulation(n_cycles, steps)

Achieves the same as the example above.

The ``walkers`` output of the ``run_cycle`` is what you expect it to be.
The ``components`` output is a collection of the various components that
may have been mutated as well during the simulation.

For example resamplers like WExplore are history dependent and stores
copies of walker states (as compressed images) in a tree of regions
which is stored in the resampler object.

These components are modified in place when called in ``run_cycle``
(this is not a purely functional process as their is no input of
components to ``run_cycle``) but we return a reference of them each
cycle so you don't need to introspect the ``Manager`` object.

Utilizing the components at this level is beyond the scope here and is
primarily for supporting orchestration facilities.

Another example is running timed simulations:

.. code:: python

   end_walkers, _ = sim_manager.run_simulation_by_time(3600, 1000)

This runs simulations for **roughly** one hour with 1000 steps per
cycle. Currently, this is implemented by checking the total runtime at
the beginning of a cycle and if the runtime has exceeded this time then
the run loop is exited. If you use this option understand that you will
need to give yourself enough time over this time to run 1 whole cycle
(if you have only 1 second left on the clock the cycle will run and no
guessing is done) plus the teardown routines.

The ``run_cycle`` method has several steps and it is very important to
understand the order in which the components are executed to be able to
reason about your simulation results, even if you are not implementing
any components.

Secondarily, understanding which methods of which components are called
and with what arguments is the *de facto* interface definition for those
components.

It is also helpful to know what your options are for storing state
during a simulation.

For example, should we write data out with a reporter or save it in the
resampler object? Should a piece of state be carried in the runner,
boundary condition (BC), resampler, or sim manager?

The execution of ``run_cycle`` is as follows:

#. ``runner.pre_cycle(walkers, n_segment_steps, cycle_idx)``
#. ``run_segment(walkers, n_segment_steps)`` ->
   ``work_mapper.map(runner.run_segment)``
#. ``runner.post_cycle()``
#. ``boundary_conditions.warp_walkers(walkers, cycle_idx)`` (if present)
#. ``resampler.resample(walkers)``
#. ``reporter.report(**report)`` for all reporters

At a high level there are two categories of components: **apparatus**
and **configuration**.

The runner, BC, and resampler are part of the apparatus. The work mapper
and reporter are part of the configuration.

The apparatus represents state that is meaningful in relation to the
content of the simulation (e.g. accumulated region definitions in
WExplore). The configuration is only related to *how* the simulation is
run in a particular situation.

For example, if you want to restart a simulation and do another run
startng at the end another, it is not enough to only copy over the
walkers at the end of the simulation (unless your components are
stateless). If you have defined a thousand regions with WExplore then
you need to have that information at the start of a run.

The configuration only deals with the reporters and work mappers; and
because neither of these can effect the actual content there is no need
to keep a copy of them at the end of a simulation in order to be able to
restart it. In fact, for reporters it is likely that you will want to
avoid this since if paths are the same then you could potentially
overwrite data.

Separating apparatus and configuration allows for the snapshotting of
simulation state separate from details about how the simulation was
actually run. For instance lets say you run one segment of a simulation
on a node with 4 GPUs and then some time later you want to continue that
run, but you only have access to a node with 2 GPUs, then you only need
to reparametrize the configuration to handle that. Another use case is
that you can add or remove reporters between runs without effecting the
apparatus.

These topics are discussed in more detail in the documentation on
orchestration since it uses these concepts for actually producing
artifacts for snapshots and configurations. The distinction, however, is
still useful here because we clearly see which components effect
simulations.

Lets start with the apparatus components since without these you won't
be needing the configuration.

#. Walkers and WalkerStates

   The topic of what a ``Walker`` object is, is very simple. It is
   simply a container which holds a state and a weight. The weight is a
   simple float value, which is assumed to be normalized with the rest
   of the weights of walkers in an ensemble (a simple list container).

   The implementation is very simple:

   .. code:: python

      class Walker(object):

          def __init__(self, state, weight):

              self.state = state
              self.weight = weight

   You can see that there is really just those two attributes.

   The state part of the walker however is a bit trickier to define.
   This partially stems from the fact that representation of simulation
   state in various dynamics engine is wildly different and impossible
   for the simulation manager itself to handle all the variants.

   Furthermore, the number of possible applications that require
   distinct kinds of states is not possible to specify up front in any
   case. For instance molecular dynamics is fairly uniform in that you
   typically only have to worry about atomic positions and velocities in
   the state and cubic box vectors. However, modern enhanced simulations
   use a wide variety of techniques that add all kinds of additional
   state such as alchemical lambda variables.

   This required the definition of a common general purpose and
   extensible specification of how to represent them for use in
   ``wepy``.

   For this simple key-value store semantics was chosen, where keys are
   strings. With the addition of one method ``dict()`` which transforms
   the object into pure python dictionary, and a constructor which takes
   values as key-word arguments. Anything that provides python like
   dictionary syntax and the ``dict()`` method can be considered to
   implement the ``WalkerState`` interface and will be called such even
   if it doesn't directly inherit from the actual ``WalkerState`` class.

   The implementation is very simple:

   .. code:: python

      class WalkerState(object):

          def __init__(self, **kwargs):
              self._data = kwargs

          def __getitem__(self, key):
              return self._data[key]

          def dict(self):
              """Return all key-value pairs as a dictionary."""
              return self._data

   Where the ``__getitem__`` magic method implements the behavior for
   the square bracket access:

   .. code:: python

      state = WalkerState(thing='hello', other_thing=np.array([0,1,2,3]))

      arr = state['other_thing']
      state_dict = state.dict()

   You can always just dump your state from whatever simulation engine
   into a ``WalkerState`` and be on your merry way:

   .. code:: python

      state_dict = {'positions' : ...,
                    'velocities' : ...}

      state = WalkerState(**state_dict)

   The sim manager takes care of copying walkers when it needs to copy
   them so you don't have to worry about returning copies or references
   to internal data such as the ``_data`` attribute in the
   ``WalkerState`` class.

   This interface also supports wrapping state objects from other
   engines. This may just be a constructor with a positional argument
   for one of these states:

   .. code:: python


      class MDEngineWalkerState():

          def __init__(self, md_state, **kwargs):

              self._state = md_state
              self._data = kwargs

          def __getitem__(self, key):

              if key == 'positions':
                  return self._state.getPositions()

              else:
                  return self._data[key]

   This approach requires no copying of the original state and makes the
   state actually accessible and retrievable is some other tool or
   library specifically needs that class.

   From the simulation managers point of view this is all that matters
   for it to work properly. However, all the other components will
   expect certain properties to be present. For example, the
   ``WepyHDF5`` reporter will expect there to be a 'positions' attribute
   as in the above example.

   Probably you should have walker states specialize in terms of the
   class definition for the runner they are being used by rather than
   the "schema" of which attributes it will contain. This allows you to
   couple the runner and the state so that you can get some performance
   optimizations by carrying around the state without having to
   transform it every time you go between them. For instance, in the
   ``OpenMMRunner`` we get the state from the ``OpenMMState`` roughly
   by:

   .. code:: python

      sim = openmm.Simulation(...)
      sim.context.setState(walker.sim_state)

   Instead of:

   .. code:: python

      sim = openmm.Simulation(...)
      sim.context.setPositions(walker['positions'])
      sim.context.setVelocities(walker['velocities'])
      sim.context.setBoxVectors(walker['box_vectors'])
      ...

#. Runners

   The Runner is the component that actually runs the sampling that the
   weighted ensemble algorithm will be enhancing, via resampling.

   As such this can be any type of stochastic dynamics or sampling
   algorithm such as Monte Carlo. Dynamics should be stochastic because
   trajectories need to be able to diverge following cloning events.

   That is if you take deterministic dynamics and make a copy of one of
   those simulations, you will perform the same exact work in duplicate
   of which there is no point to do in parallell. Furthermore, it will
   be impossible to enhance sampling from resampling because we need to
   be able to capitalize on differences that arise between those cloned
   simulations.

   A Runner in wepy is typically a wrapper around some other dynamics
   engine as they can be of considerable complexity and highly domain
   specific.

   The principle method a Runner must implement is ``run_segment`` which
   takes a walker, a definition of how long to run that segment, called
   the ``segment_length``, and possibly a set of key-value based
   arguments.

   This function should then return a single walker which has had it's
   state updated according to those input parameters.

   The principle runner in ``wepy`` is the OpenMM runner which
   essentially just does some initialization and then calls:

   .. code:: python

      simulation.step(segment_length)

   to run the simulation segment.

   Within the ``Manager.run_cycle`` method there is a call to a
   simulation manager method, also called, ``run_segment``. This,
   ``Manager.run_segment`` method in turns calls the
   ``runner.run_segment`` once for each walker in the current ensemble.
   This is simply the common semantics of ``map`` function which takes a
   single function and applies it to multiple pieces of data.

   The behavior of how this is achieved is encapsulated within the
   ``work_mapper`` object. For ``wepy`` a ``work_mapper`` must simply
   have a method called ``map`` that has the same function signature as
   the python built-in :keyword:`map`, except that the function to be
   called is an attribute of the object.

   Basically, the work mapper is called as such:

   .. code:: python

      new_walkers = list(self.work_mapper.map(walkers,
                                              (segment_length for i in range(num_walkers)),
                                             )

   See the section on work mappers for more details on implementing
   them.

   Two additional methods are also called for the runner in order to get
   a single call to the runner per cycle which are: ``pre_cycle`` and
   ``post_cycle``. Call these if you have some state in the runner that
   needs to be updated outside of the ``run_segment`` calls.

#. Boundary Conditions

   Boundary conditions (often abbreviated as BC) are extra conditions
   that are placed in the simulation that allow for executing extra
   rules about the transformation of walker states.

   This is very useful for doing non-equilibrium simulations where once
   walkers have reached some predetermined condition or region the
   simulation is restarted in some original location. This allows for
   the calculation of rates from simulations.

   BCs are strictly optional and conceptually could be implemented
   within the Runner itself. However, having them separate makes them
   more composable with different simulations. Furthermore, BCs are
   useful for reporting information on walkers as a simulation
   progresses that are not computed in the runner engine.

   BCs are applied after runner steps are completed and is called
   basically as so:

   .. code:: python

      warped_walkers, warp_data, bc_data, progress_data  = \
                                          self.boundary_conditions.warp_walkers(walkers,
                                                                                cycle_idx)

   The name ``warp_walkers`` is meant to evoke the sense in which
   walkers are getting transformed according to something outside of the
   normal laws of physics the simulations implement. A typical example
   is non-equilibrium unbinding simulations (see the ``UnBindingBC``
   class) where walkers start with a state where a small ligand molecule
   is bound to a binding site in a protein and sampling proceeds until
   the molecule has left the binding site and moved away from the
   protein. At that point the boundary conditions recognize this and
   "warp" the walker so that it's state is replaced with the original
   starting state.

   These events are recorded in the return ``warp_data`` object. Which
   is the first example of a record data type. So lets take a moment to
   describe those.

   In addition to the walkers there are a number of different pieces of
   data that are produced by the BCs and resampler components. These are
   documented fully in the developer's architecture guide in terms of
   their formats. But suffice to say now that they all have a key-value
   or record oriented data definition that makes it much more convienent
   to implement storage layers, since they can all be essentially
   treated the same way except for their names. These records are
   vitally important to interpreting ``wepy`` simulation data because
   walker trajectories are no longer straightforward linear simulations,
   and may have various warping and merging events that destroy old
   states.

   The ``warp_data`` warping records are especially important because
   they tell you where and when simulations were respawned in
   non-equilibrium simulations which tells you how to reconstruct
   contiguous trajectories as well as how to calculate rates.

   One other possibility for warp records is that they do not actually
   "warp" the walker in the sense that they may mutate walker state
   attributes which are orthogonal to the dynamics engine. This can be
   used to implement "colored" dynamics where when a walker reaches some
   boundary an enumerated value (called the color) is changed to
   indicate the last boundary it crossed was. This color has no effect
   on MD propogation but is useful for calculating kinetics of the
   process while running what are essentially equilibrium simulations.
   Warping events that effect the same variables as the dynamics engine
   are often called "discontinuities".

   An optional interface a ``BoundaryConditions`` class can implement to
   determine whether a record indicates a discontinuity is
   ``warping_discontinuity(warp_record)`` which returns a boolean. This
   is used by some of the analysis routines to automatically obtain
   continuous trajectories or to show in tree graphs where exactly
   warping events occured.

   The other two record types are fairly accessory: BC records and
   progress records. The BC records are meant to allow for reporting on
   the changes in state of the boundary conditions as a simulation
   progresses. I am not aware of any practical use of this, but one
   could imagine changing the value of a cutoff as a function of
   simulation time.

   The progress records are not critical to the functioning of the
   simulation but are a way to not waste values which are computed when
   checking for boundary conditions. Unlike warping and BC records
   progress records are produced every cycle once for every walker. For
   example, in ``UnbindingBC`` the minimum distance of a ligand to the
   protein is calculated every cycle to check whether any ligand has
   unbound. Instead of dropping these numbers on the floor we pass them
   through with the progress data and any reporter that is interested in
   them can report them.

   One can also imagine calculating values which are not necessary for
   making a decision to warp or not here, but we would caution that from
   a performance perspective that this is not wise since the
   ``warp_walkers`` call is blocking the progression of the simulation
   and creating overhead. The real bottleneck in terms of time is
   usually the dynamics (especially in the case of MD) and an
   implementer of any Runner, BC, or Resampler component should aim to
   make them efficient so as to be able to run as much dynamics as
   possible. Of course there is a tradeoff here and should be approached
   from the perspective of improving the performance of the metric you
   are looking for rather then raw MD throughput. Presumably, the reason
   you are using WE is that brute-force sampling is not fast enough to
   begin with.

   Calculating observables on WE data is very convient using the
   analysis tools in ``wepy``. If you do want to calculate quantities
   on-the-fly for some reason this should be done in a reporter. This
   might want to be done if you aren't storing the entire state on disk
   because it is too large but you still want to monitor some value that
   is a funtion of it. E.g. computing the average kinetic energy
   temperature from the velocities. Typically you don't store every
   frame of velocities because it uses too much disk space, but you
   could compute the temperature in a reporter and just write that
   single number. Furthermore, while it currently is not the case now,
   it is possible to completely move reporting out of the critical path
   of the simulation so they do not block. This is possible because
   reporting is a pure side-effect of the simulation, but just requires
   a more complex concurrency architecture and fault tolerance.

#. Resamplers

   Resamplers are the heart and soul of ``wepy`` and are the loci of the
   actually complex and interesting algorithms.

   I will eschew a description of what purpose a resampler serves at
   this point as this is better described in a somewhat formal context.
   For more information see the resources in the introduction.

   In terms of what a resampler component looks like and does can be
   quite distinct from some of the theoretical formulations. This
   freedom is the key to the flexibility of using ``wepy`` for
   prototyping new resampling algorithms.

   Minimally all a resampler must do is implement the ``resample``
   method, e.g. the trivial ``NoResampler`` is implemented like this:

   .. code:: python

      from wepy.resampling.resamplers.resampler import Resampler

      class NoResampler(Resampler)

          def resample(self, walkers, **kwargs):
              ...
              resampling_data = self._init_walker_actions(len(walkers))
              ...
              return walkers, resampling_data, [{}]

   where we just return the original walkers we were given. The
   additional return values are records related to the resampling
   records which report on how the cloning and merging took place
   (``resampling_data``) and the resampler records which report on state
   changes of the resampler itself.

   The resampling records here are just the default ones produced by the
   ``_init_walker_actions`` and there is not state for this resampler so
   we just produe a single empty record for that.

   The more important record types are the resampling records as they
   are what lets us reconstruct a family tree of walkers from cloning
   and merging. The resampler records on the other hand are just for
   monitoring of the resampler during the simulation and very specific
   to each resampler. The discussion of the field types and format of
   the resampling records is a bit involved and largely unnecessary to
   understand unless you are implementing a very specialized resampler.

   If you are just using a resampler off of the shelf just know that
   these are saved in the ``WepyHDF5`` format and the various analysis
   tools will take care of all the mundane details of utilizing them.

#. Reporters

   Reporters are the primary mechanism for saving data about
   simulations.

   As shown above you could just run a cycle on your own and introspect
   the objects and get the information you want. However, this would be
   specific to the implementation of each component. All reporters that
   are called from ``run_cycle`` can expect the same structure of data
   no matter the component that produced them.

   Besides the ``init()`` and ``cleanup()`` methods each reporter must
   implement the ``report`` method which takes some key-word arguments.
   The key-value pairs that the manager passes to the reporters is the
   same, but each reporter chooses which ones it cares about.

   This dictionary collectively is called the report. Currently, it has
   these keys in it:

   -  ``cycle_idx``
   -  ``n_segment_steps``
   -  ``new_walkers``
   -  ``warp_data``
   -  ``bc_data``
   -  ``progress_data``
   -  ``resampling_data``
   -  ``resampler_data``
   -  ``resampled_walkers``
   -  ``worker_segment_times``
   -  ``cycle_runner_time``
   -  ``cycle_bc_time``
   -  ``cycle_resampling_time``

   The 'time' fields are various timings that are made of the components
   for some performance reporting, and the rest have been discussed
   already.

   This listing might change more frequently so if you are unsure check
   the source code.

   Also when writing a ``report`` method always accept extra kwargs to
   handle new ones, e.g.:

   .. code:: python

      from wepy.reporter.reporter import Reporter

      class MyReporter(Reporter):
          def report(self, cycle_idx=None,
                     n_segment_steps=None,
                     cycle_resampling_time=None,
                     **kwargs):
              ...

   In addition to the ABC ``Reporter`` class the ``FileReporter`` and
   ``ProgressiveFileReporter`` are very useful to inherit from as they
   handle some file path and file mode logic, the latter updates modes
   to allow for repeated writes to the same file for each cycle of a
   simulation.

   For example the ``DashboardReporter`` need only handle parameters
   specific to its own function and all the handling of filenames is
   done by a call to the superclass constructor:

   .. code:: python

      from wepy.reporter.reporter import ProgressiveFileReporter

      class DashboardReporter(ProgressiveFileReporter):

          def __init__(self,
                       step_time=None,
                       bc_cutoff_distance=None,
                       **kwargs
                      ):

              # handle filename(s) and mode(s) in the superclass
              super().__init__(**kwargs)

              # Dashboard logic
              ...
              self.step_time = step_time
              self.bc_cutoff_distance = bc_cutoff_distance

#. Work Mappers

   The final component is the work mapper. As mentioned in the section
   on runners this is what actually achieves task parallelism over the
   walker's dynamics segments.

   The simplest and default mapper is the simple ``Mapper`` class.
   Basically, it works by first constructing it with the function you
   want to map (in the case of the simulation manager it automatically
   does this with the ``runner.run_segment`` function) and then using a
   simple for-loop to sequentially compute the segments:

   .. code:: python


      class Mapper(object):

          def init(self, segment_func):

              self._func = segment_func

          def map(self, *args):
              args = [list(arg) for arg in args]

              results = []
              for arg_idx in range(len(args[0])):

                  result = self._func(*[arg[arg_idx] for arg in args])
                  results.append(result)

              return results

   This is okay for test systems but for real simulations that take a
   long time we will need to use some sort of parallelism.

   Currently, we provide a work mapper that uses a queue to put tasks on
   (the ``run_cycle`` plus the arguments) and worker processes fetch
   tasks off of the queue to perform whenever they are able and done
   with the next task. This is the ``WorkerMapper`` class, which starts
   ``Worker`` object processes using the python multiprocessing library.

   Because, we are using OS processes instead of "threads" it is truly
   parallel when using the CPython runtime, which uses the infamous
   Global Interpreter Lock (GIL). The GIL effectively makes it so that a
   single python process can never be multi-threaded or parallel, but
   will still let you program with thread semantics and maybe make you
   believe you are multi-threaded. It may be possible to use another
   python runtime like PyPy to get around this but this has not been
   tested.

   Another note when using OS threads is that you will need to make sure
   you are creating processes in a way which is compatible with the
   dynamics engine runtime. For example, in linux systems you can make
   processes with a cheap ``fork`` syscall, or the more robust but more
   expensive ``spawn`` syscall. When creating processes when a CUDA
   context has been defined, you must use the ``spawn`` option (at least
   with OpenMM). This can be set in your run script like so:

   .. code:: python

      import multiprocessing as mp

      # set the process creation method
      mp.set_start_method('spawn')

      # useful tip for logging in multiprocessing:
      mp.log_to_stderr(logging.WARNING)

   For different environments and runners you can use different worker
   types for customization if necessary. This is one case where
   inheritance is very important since inheriting from the ``Process``
   base class is very important.

   For example, the OpenMM module defines two workers for either CPU
   (``OpenMMCPUWorker``) or GPU based workers (``OpenMMGPUWorker``). The
   former allows you to specify the number of threads to use per CPU and
   the GPU worker just specifies which GPU device index to use.

   In the simulation managers call to ``init`` a worker process is
   created for each device that is present (CPU or GPU) and two queues
   are initialized, the work queue and the results queue. The worker
   processes then begin polling the queue for items. At the beginning of
   a cycle one ``Task`` object per walker is placed on the work queue
   and immediately the workers begin popping of tasks. Each worker then
   computes the walker-task and places the result onto the result queue
   then polls the queue again for new tasks, until they reach the end of
   the tasks. At the end of the cycle the main simulation manager
   process pops off the results from the result queue and structures
   them as walker states. At the end of a simulation a special "poison
   pill" is placed on the work queue for each worker which is a signal
   to shut down.

Resampling Framework
~~~~~~~~~~~~~~~~~~~~

See the sub-package documentation: ``wepy/resampling/__init__.py``

Simulation Data Persistence (WepyHDF5) and Analysis
---------------------------------------------------

We have discussed the components that are necessary to run a simulation
using the simulation manager and the interfaces these components must
implement.

The other half of the equation is to store the data associated with the
simulation and be able to analyze and transform that data. This is where
the ``WepyHDF5`` format comes into play.

The module ``wepy.hdf5`` has a class ``WepyHDF5`` which defines an
interface for creating, accessing, and adding data to a single HDF5
format file which can be used for any ``wepy`` simulation.

If you are not familiar with HDF5, it is a general purpose binary format
that is used for large amounts of structured numerical data. While
in-depth knowledge of how HDF5 works is not necessary to use the
``WepyHDF5`` class, it definitely makes sense to at least get an
overview of the performance and memory behaviors. For this I suggest
just going through the documentation for the
`h5py <https://github.com/h5py/h5py>`__ library and the book `"Python
and HDF5: Unlocking Scientific Data" by Andrew
Collette <https://www.amazon.com/Python-HDF5-Andrew-Collette/dp/1449367836>`__
who is also the original ``h5py`` author.

``WepyHDF5`` uses ``h5py`` under the hood and so if there is ever a
functionality that one of it's methods doesn't provide you can always
drop down and use it.

The main features of HDF5 are the existence are groups and datasets,
which are roughly equivalent to directories and files in common
hierarchical filesystems. The difference between datasets and files
being that HDF5 datasets must have explicit data shapes and types
(integers, floats, stings, etc.). Groups contain other groups and
datasets, and datasets make up the leaves of the tree. In ``h5py``
groups and datasets also have string paths like files in order to access
them.

The core HDF5 library simply gives these building blocks to the
structure, while the ``WepyHDF5`` class specifies and implements a
"schema" using these building blocks. So a ``WepyHDF5`` is just any file
that has the same structure as one that would be constructed or read by
the ``WepyHDF5`` class. An more in depth (but not formal) description of
this "schema" is given in the module API documentation.

Briefly though, the file is primarily organized by the concept of a run.
Each "run" contains all the data and metadata for a single ``wepy``
simulation, that is after the call to ``Manager.init()`` every
``Manager.run_cycle`` writes to the same run until we call
``Manager.cleanup()``. Or a call to ``Manager.run_simulation`` etc.

A run contains essentially two types of datasets: trajectories and
records. Trajectories (including the initial walkers) are the results of
the sampling step produced by the runner. A single trajectory is a group
containing any number of "fields" which are just a single attribute of a
frame of a trajectory. This typically includes the positions, box
vectors, velocities (if given) and the temperature, volume, etc. for a
typical molecular dynamics simulation. The records are the data produced
by the various components like the resampler and boundary conditions
(BCs). The meaning of these different record groups is discussed in the
documentation for these components. Their storage in the HDF5 is the
same however, and works again using any number of fields like the
trajectories.

So the basic outline of an HDF5 file is:

-  runs

   -  run: 0

      -  trajectories

         -  traj: 0

            -  field: ``positions``
            -  field: ``box_vectors``
            -  …

      -  resampling records

         -  field: ``decision``
         -  field: ``target_idxs``
         -  …

      -  warping records
      -  …

   -  run: 1

      -  …

The primary way in which this file is created is by using the
``WepyHDF5Reporter``. If there is only one reporter you should ever use
it is this one! Please see the tutorials and documentation for how to
fully make use of this reporter.

Once you have generated a ``WepyHDF5`` file from a run (or many runs)
you will want to analyze the data. For this the ``wepy.analysis``
sub-package is available along with some basic functions in the
``WepyHDF5`` API. The analysis package is intended to be limited to
functionality which works directly on the ``WepyHDF5`` file or from a
set of records from a component. This is in order to not bloat ``wepy``
with all manner of domain specific analysis tools that get overly
integrated to our own peculiar data structures. These tools provide a
way to transform a subset of your data into other formats like numpy,
pandas, networkx, and mdtraj. They also provide utilities for giving
different views onto the data so that excessive copying of the
trajectory data is not needed.

The most useful method is the ``WepyHDF5.compute_observable`` method,
which you pass a function to compute some sort of value over all of the
frames of your trajectories. These computed values can either be
returned to be used in some other context or written directly to the
file as a trajectory field. Writing it to the file has the advantage
that later transformation views on the file will always have direct
access to these "observables" fields without having to deal with complex
indexing schemes to use with external data.

The three primary "views" provided by the analysis modules are in the
``contig_tree``, ``network``, and ``parents`` modules.

The ``contig_tree`` module introduces the notion of a **contig** (a term
borrowed from the genomic assembly community, but totally distinct
here). A contig in this sense is simply the concatenation of multiple
``wepy`` runs to form a single /contig/uous whole. The **contig tree**
is a more general expression of this and represents the actual tree (or
forest) of runs that are started from each other. For instance you could
do one run and then restart it in two distinct simulations, in which
case you now have a tree. This tree-like structure makes it difficult to
more difficult to peform sliding window calculations and other things
and so provides this special functionality.

But why add this extra layer of abstraction over top of runs? I don't
have bifurcating simulations so couldn't we just keep concatenating
frames to a single run and just analyze that?

The answer is yes I suppose you could do that if you want. However, this
use-case is not explicitly provided for in ``wepy`` because we see the
unit of "run" as both the data produced and the time, place, and
machine(s) that it was computed on. A run should be produced by the
execution of a single script or job on a timesharing system like SLURM
or Torque. This allows for provenance of the units of execution,
otherwise you would need to keep an index of when which cycles of the
run were executed from which jobs. Furthermore, it supports immutability
of already completed work. Instead of modifying the chunks of data
inside the runs and potentially corrupting them, just keep adding new
runs which don't touch the other ones. Our typical workflow is to
produce a single run in a single file per job, and to never fiddle with
that file until it has been properly aggregated and archived. There are
tools in ``wepy`` that aid in linking between files and aggregating
files so that a single ``WepyHDF5`` object has access to data to many
other files.

The contig and contig tree are the conceptually complete unit of a
"simulation".

See the tutorials on how to make use of them.

The next data "view" is the ``wepy.analysis.network.MacroStateNetwork``.
Outside of resampling type enhanced sampling algorithms simulations are
very linear and so you always had one canonical way to go through the
data that makes sense. Of course when trajectories became very long the
practicality of this is challenged and so various techniques for
reducing the dimensions are used such as clustering and things like
Markov State Models (MSMs). These representations are what we call
Conformation State Networks (CSNs) or Macrostate Networks. They are
essentially networks in which the nodes are some sort of "macro-state"
that represents a collection of "micro-states", and the edges represent
the observed transitions between the macrostates as determined from the
transitions between microstates seen from dynamics. The various names
indicate certain mathematical properties of the values of the edges and
nodes including rates and probabilities, but the structure is the same.

The ``MacroStateNetwork`` class wraps a ``WepyHDF5`` object and holds a
mapping of trajectory frame indices for each "macro-state" in a network.
This mapping can be automatically made by providing a field name from
the trajectories and each unique value will become it's own macrostate.
Probably this field should be some sort of enumerated type like an
integer or string which can be calculated using the
``compute_observable`` method. Typically, this will be the result of
some clustering or MSM algorithm (supported by sliding window methods of
the ``ContigTree``).

The network is implemented as a ``networkx`` directed graph and any of
the multitude of network and graph theory algorithms there can be
leveraged for analyzing your state network. Furthermore, using the
``MacroStateNetwork`` allows very easy introspection of the microstates
from any single macrostate. Finally, using the ``MacroStateNetwork`` it
is trivial to produce transition probability matrices (edge matrices)
which can be used to calculate committor probabilities etc. from a
network. See the ``wepy.analysis.transitions`` module for relevant
functions.

The network representation of WE data is particularly important because
there really is no canonical ordering of frames within the walker
cloning & merging family tree and so the natural representation is the
state network. However, if you run a simulation with boundary conditions
there is one meaningful linear representation which is the trajectory of
walker that has crossed a boundary.

We call these linear representations of trajectory data from the entire
contig **traces**. In order to be able to obtain traces we first have to
use the resampling records to determine which walkers give rise to
walkers in later cycles. Secondarily, we use the wapring records to
determine if there were any discontinuous warping events that occur
along these traces.

The primary object that abstracts the walker family tree is the
``ParentForest`` class (in ``wepy.analysis.parents``) and the ``Contig``
class. See the tutorials for a complete example of how to use this. Some
of the more useful functions here are the
``Contig.exit_point_trajectories`` which generates a full lineage of
each walker that crosses a boundary. The ``ParentForest`` provides a
``networkx`` directed graph of the tree which makes it amenable to the
algorithms available there. The ``parents.ancestors`` gives a complete
lineage from any walker.

In addition to these basic views onto the underlying HDF5 dataset there
are also a few analysis routines for calculating rates and free energy
profiles which are a very common use case for simulations.

First using boundary conditions is often for the purpose of calculating
rates. The ``wepy.analysis.rates`` module covers this.

The ``wepy.analysis.profiles`` module covers generating free-energy
profiles and probability distributions for both the entire simulation
and as a series so you can easily see the convergence of a simulation
with relation to a given projection.

JSON Topology and Converting to Other File Formats
--------------------------------------------------

``wepy`` is not a molecular data file reader/writer of which there are
great many of. This topic can cause considerable headaches if not done
properly. For writing to file formats such as PDB, DCD, XTC, and all the
rest we rely on the ``mdtraj`` library to satisfy this need. It probably
wouldn't be too difficult to make a connector to another library if you
really need it so don't think this is the only way.

The ``WepyHDF5`` object, and analysis wrappers thereof, provide a number
of methods for generating ``mdtraj.Trajectory`` and ``mdtraj.Topology``
objects from stored data in various ways. See the API reference for a
full listing of options.

Read the ``mdtraj`` documentation to get all of the options, but know
that it is as really simple as:

.. code:: python

   traj = wepy_hdf5.to_mdtraj(...)
   traj.save_pdb('mymolecule.pdb')

It is worth noting that the JSON topology format that is used in
``WepyHDF5`` was actually taken from the HDF5 file format defined and
implemented in ``mdtraj``. The actual function for converting
``mdtraj.Topology`` objects to JSON and back again was a bit hidden so
we extracted it and provide them as utilities in ``wepy.util.mdtraj``:
``json_to_mdtraj_topology`` and ``mdtraj_to_json_topology``.

Another useful trick is that ``mdtraj`` also has a converter to the
OpenMM topology object: ``mdtraj.Topology.to_openmm`` and
``mdtraj.Topology.from_openmm``. This comes in handy for serializing
your topologies to JSON after you create them in OpenMM.

There are several shortcomings in this JSON topology format in the
opinions of this author, however after surveying all available topology
format we have found it to be the most unambiguous and "programmable"
format and so rely on it.

Of course different applications will different types "topologies", no
topologies, or altogether different system specifications and so the
HDF5 format should not be seen as being tied to this format for
molecular systems. It primarily provides a good substrate for generating
other files which are needed by other programs.

That said there are no extensive libraries supporting it. However, there
really isn't any need since the parser is in the python standard
library:

.. code:: python

   import json
   top = json.loads(json_top_str)

where the ``top`` object is just native python types making it easy to
do basic selections of atoms based on their atom or residue names and
types or the bond connectivity. If you want to do more complex things
like chemoinformatics or structural informatics you will want to cast
this to a purpose built representation. There are JSON parsers in just
about every language and so it is pretty portable in that sense.

We do provide a few useful functions that make working with it a tad
easier which are contained in the ``wepy.util.json_top`` module. The
highlights there are functions for generating ``pandas.DataFrame``
tables for either: atoms, residues, or chains; e.g.
``json_top_atom_df``. The other being a function for getting a new
topology from a subset of atoms from the original ``json_top_subset``.
This is extremely useful for generating files for subsets of your entire
MD system and excluding things like waters.