HDF5 Reporter Data Structure

Record Groups

The protocol by which non-trajectory data is given by the resampler and boundary conditions (BC) is unified that makes it simpler to save in formats like HDF5.

The resampler and BC both have multiple record groups:

  • resampler

    • resampling

    • resampler

  • BC

    • warping

    • progress

    • boundary conditions

A record group can be thought of as a single table in a relational database. Each record group corresponds to a class of events that occur and each record in a record group corresponds to one event.

Record groups can be continual or sporadic.

A continual record is recorded once per cycle. A continual record reports on the event of a cycle.

A sporadic record can be reported 0 or many times per cycle and responds to the event determined by the record group.

  • continual

    • progress

  • sporadic

    • resampler

    • resampling

    • warping

    • boundary conditions

As you can see currently most records are sporadic. This distinction is really only used internally within the WepyHDF5 class to distinguish how it stores them, but this distinction is useful in data analysis as well.

Resampling Records

The 'resampling' records are probably the most important records for wepy because they are what records the cloning and merging of walkers.

Without the 'resampling' your wepy simulation would have been wasted since you no longer will know the history of any given frame. You will just have a bag full of unconnected pictures.

Records for 'resampling' happen for each “assignment” event of a walker during resampling, this minimally should contain two fields: 'decision_id' and 'target_idxs'.

The 'decision_id' is an integer corresponding to an enumeration of the possible decisions that can be made as to the fate of the walker during resampling. While technically these decisions are also modular it is likely that 99.9% of all users will use the CloneMergeDecision.

Detailed knowledge of this formalism is not usually needed in the practice of writing resamplers that behave well, which is another topic, and the next few paragraphs can be safely skipped.

The enumerated decisions in this are:

NOTHING

1

CLONE

2

SQUASH

3

KEEP_MERGE

4

The NOTHING decision means don’t clone or merge this walker.

CLONE means clone this walker.

SQUASH and KEEP_MERGE are related in that both involve merging.

A single merge includes a set of walkers that will be merged together, there must be at least 2 such walkers in this “merge group”.

From the merge group only a single state will be preserved in the single resulting walker, while the weight of the final walker will be the sum of all those walkers.

The state of the final walker will be drawn from the set of walkers in the merge group based on the behavior of the resampler (usually a choice weighted by their weights), but will always be identical to one of the walkers. The walker with the chosen state is the KEEP_MERGE walker. The rest are the SQUASH walkers.

The second field, 'target_idxs', actually determines which walkers will be merged with what other walkers, and is a tuple of integers indicating the location, or slot.

A ‘slot’ is simply an available position in the lineup of walkers that will be simulated in a single cycle of WE. The number of slots is the number of walkers that will be simulated in the next cycle.

As an aside: In general the number of walkers used in a WE simulation is not specified (other than there needs to be more than 1). You can have a constant number of walkers, or a dynamic one with the number fluctuating during the simulation.

If you have too small a number of walkers then you will have a relatively sparse coverage of the sample space.

If you have too many the cycle throughput will be very slow.

Additionally, simulations run with GPUs will want to have a number of walkers each cycle that is a multiple of the number of GPUs or a number of the GPUs will be lying idle when the task queue of running walker runner segments is depleted.

So typically there is some constraint on the the number of slots available in the next WE cycle. The constraint is decided on and enforced by the resampler. So if there is a mismatch in the resampling records and the walkers produced the wepy simulation manager will not complain.

WARNING: Currently the WepyHDF5 storage backend and reporter do not support dynamic numbers of simulations. While technically the none of the other code has any problem with this.

The 'target_idxs' value for NOTHING and KEEP_MERGE is a 1-tuple of the integer index of slot where the resultant walker will be placed.

The 'target_idxs' for CLONE is an n-tuple of integer indices of slots where n is the number of children of the clone and n must be at least 2 (or it would’ve been a NOTHING).

The 'target_idxs' of SQUASH is also a 1-tuple like NOTHING except since a SQUASH has no child it indicates the KEEP_MERGE walker that it’s weight is added to. Note that this slot index is the slot index that the KEEP_MERGE record itself specifies and not the slot the KEEP_MERGE walker previously occupied (as that index is of no consequence to the current collection of walkers).

Thus a KEEP_MERGE walker defines a single merge group, and the members of that merge group are given by which SQUASH targets.

Critically, the 'step_idx' and 'walker_idx' (slot index of walker in last cycle) fields should also be supplied so that the lineage histories can be generated.

In addition to the Decision class record fields any other amount of data can be attached to these records to report on a resampling event.

For example in the WExplore resampler the region the walker was assigned to is also given.

Warping Records

The next most important record is the warping records.

These are of course only relevant if you are using boundary conditions, but among the three BC these are the principal object.

Warping records determine the action that was taken on a walker after it met the criteria for a boundary condition event.

Minimally it should specify the 'walker_idx' that was acted on, and if any warping event can be discontinuous the ‘weight’ of it so this can be accounted for in analysis.

The rest of the specification for boundary conditions does not have a protocol similar to the one for cloning and merging records and is left up to the developer of the class to decide.

For simple boundary conditions where there is only one result an additional field is not even necesary.

The colored trajectories examples provides a possible example. In this case you could have a field called 'color' which is the new “color” of the walker which indicates the last boundary it crossed and could be a string or an integer enumeration.

Boundary Condition Records

This and all the other record groups are really optional.

A single boundary condition record reports on the event of a change in the state of the boundary condition object.

For example if the cutoff value for a ligand unbinding boundary condition changes during a simulation.

Resampler Records

These records report on events changing of the state of the resampler.

For example in WExplore a single record is generated every time a new region/image is defined giving details on the values that triggered this event as well as the image that was created.

This interpretation is semantically useful but in practice this reporter could also report on collective attributes of the walkers, such as all-to-all distances or histograms of the current batch of walkers.

Its up to the writer of the resampler to decide.

Progress Records

Progress records are provided mainly as a convenience to get on-line data analysis of walkers during a simulation.

For instance in ligand unbinding the progress may be the distance to the cutoff, or RMSD to the original state.

While the active observer may note that these calculations may also have been implemented in a reporter as well.

There are a few tradeoffs for that approach though.

One, the value may have already been calculated in the process of evaluating walkers for warping and double calculation is potentially unacceptably wasteful (although one might imagine complex systems where reporters perform their actions asynchronously to the flow of the simulation manager moving onto new cycles).

Second, the flow of data will be forked. For example when using the WepyHDF5Reporter all the data it will report on is assumed to be contained in records returned by the runner, resampler, and boundary conditions and can’t know of another reporter. Nor is it easy nor wise to have two reporters acting on the same database.

Perhaps such analysis could be implemented as analysis submodules in the WepyHDF5Reporter to keep a single stream of data, if you think that way go ahead and make a pull request.

Specifying Record Group Fields

Each record group should have three class constants defined for it.

This is strictly not necessary from the perspective of either the simulation manager or the primary consumer of these records, the WepyHDF5Reporter, but is a very good practice as it will help catch bugs and will clarify the results your BC or resampler will produce for those inspecting them.

The three definitions are:

  • field names

  • shapes

  • dtypes

Each should be defined as a class constant prefixed by the name of the record group followed by the definition type, for example the resampling record group of WExplore looks like this:

import numpy as np
from wepy.resampling.decisions.clone_merge import MultiCloneMergeDecision

DECISION = MultiCloneMergeDecision
RESAMPLING_FIELDS = DECISION.FIELDS + ('step_idx', 'walker_idx', 'region_assignment',)
RESAMPLING_SHAPES = DECISION.SHAPES + ((1,), (1,), Ellipsis,)
RESAMPLING_DTYPES = DECISION.DTYPES + (np.int, np.int, np.int,)

For the “fields” this is the name of the field and should be a string. In the example we are using fields defined from the MultiCloneMergeDecision class.

The shapes are the expected shapes of a single element of the field. Three types of values are accepted here:

  1. A tuple of ints that specify the shape of the field element array.

B. Ellipsis, indicating that the field is variable length and limited to being a rank one array (e.g. (3,) or (1,)).

C. None, indicating that the first instance of this field will not be known until runtime. Any field that is returned by a record producing method will automatically interpreted as None if not specified here.

Note that the shapes must be tuple and not simple integers for rank-1 arrays.

It is suggested that if possible use option A. Option B will use a special datatype in HDF5 for variable length datasets that can only be 1 dimensional, in addition to being much less efficient to store.

Option C is not advisable but is there because I know people will be lazy and not want to define all these things. By defining things ahead of time you will reduce errors by catching differences in what you expect a field to look like and what you actually receive at runtime.

If you are actually saving the wrong thing and don’t specify the shape and dtype then you may run weeks of simulations and never realize you never saved the right thing there.

The dtypes have similar options but there is no Ellipsis option.

Each non-None dtype should be a numpy dtype object. This is necessary for serializing the datatype to the HDF5 (using the numpy.dtype.descr attribute).

Record Fields

One additional class constant can be defined to make analysis in the future easier.

When accessing records from a WepyHDF5 object you can automatically generate pandas.DataFrames from the records, which will select from a subset of the fields for a record group. This is because large arrays don’t fit well into tables!

So you can define a subset of fields to be used as a nice “table” report that could be serialized to CSV. For instance in WExplore’s resampler record group we leave out the multidimensional 'image' field:

import numpy as np
RESAMPLER_FIELDS = ('branching_level', 'distance', 'new_leaf_id', 'image')
RESAMPLER_SHAPES = ((1,), (1,), Ellipsis, Ellipsis)
RESAMPLER_DTYPES = (np.int, np.float, np.int, None)

# fields that can be used for a table like representation
RESAMPLER_RECORD_FIELDS = ('branching_level', 'distance', 'new_leaf_id')

Again, its not necessary, but its there to use.