wepy.hdf5 module

Primary wepy simulation database driver and access API using the HDF5 format.

The HDF5 Format Specification

As part of the wepy framework this module provides a fully-featured API for creating and accessing data generated in weighted ensemble simulations run with wepy.

The need for a special purpose format is many-fold but primarily it is the nonlinear branching structure of walker trajectories coupled with weights.

That is for standard simulations data is organized as independent linear trajectories of frames each related linearly to the one before it and after it.

In weighted ensemble due to the resampling (i.e. cloning and merging) of walkers, a single frame may have multiple ‘child’ frames.

This is the primary motivation for this format.

However, in practice it solves several other issues and itself is a more general and flexible format than for just weighted ensemble simulations.

Concretely the WepyHDF5 format is simply an informally described schema that is commensurable with the HDF5 constructs of hierarchical groups (similar to unix filesystem directories) arranged as a tree with datasets as the leaves.

The hierarchy is fairly deep and so we will progress downwards from the top and describe each broad section in turn breaking it down when necessary.

Runs

The meat of the format is contained within the runs group:

  • runs

    • 0

    • 1

    • 2

Under the runs group are a series of groups for each run. Runs are named according to the order in which they were added to the database.

Within a run (say ‘0’ from above) we have a number of items:

  • 0

    • init_walkers

    • trajectories

    • decision

    • resampling

    • resampler

    • warping

    • progress

    • boundary_conditions

Trajectories

The ‘trajectories’ group is where the data for the frames of the walker trajectories is stored.

Even though the tree-like trajectories of weighted ensemble data may be well suited to having a tree-like storage topology we have opted to use something more familiar to the field, and have used a collection of linear “trajectories”.

This way of breaking up the trajectory data coupled with proper records of resampling (see below) allows for the imposition of a tree structure without committing to that as the data storage topology.

This allows the WepyHDF5 format to be easily used as a container format for collections of linear trajectories. While this is not supported in any real capacity it is one small step to convergence. We feel that a format that contains multiple trajectories is important for situations like weighted ensemble where trajectories are interdependent. The transition to a storage format like HDF5 however opens up many possibilities for new features for trajectories that have not occurred despite several attempts to forge new formats based on HDF5 (TODO: get references right; see work in mdtraj and MDHDF5).

Perhaps these formats have not caught on because the existing formats (e.g. XTC, DCD) for simple linear trajectories are good enough and there is little motivation to migrate.

However, by making the WepyHDF5 format (and related sub-formats to be described e.g. record groups and the trajectory format) both cover a new use case which can’t be achieved with old formats and old ones with ease.

Once users see the power of using a format like HDF5 from using wepy they may continue to use it for simpler simulations.

In any case the ‘trajectories’ in the group for weighted ensemble simulations should be thought of only as containers and not literally as trajectories. That is frame 4 does not necessarily follow from frame 3. So one may think of them more as “lanes” or “slots” for trajectory data that needs to be stitched together with the appropriate resampling records.

The routines and methods for generating contiguous trajectories from the data in WepyHDF5 are given through the ‘analysis’ module, which generates “traces” through the dataset.

With this in mind we will describe the sub-format of a trajectory now.

The ‘trajectories’ group is similar to the ‘runs’ group in that it has sub-groups whose names are numbers. These numbers however are not the order in which they are created but an index of that trajectory which are typically laid out all at once.

For a wepy simulation with a constant number of walkers you will only ever need as many trajectories/slots as there are walkers. So if you have 8 walkers then you will have trajectories 0 through 7. Concretely:

  • runs

    • 0

      • trajectories

        • 0

        • 1

        • 2

        • 3

        • 4

        • 5

        • 6

        • 7

If we look at trajectory 0 we might see the following groups within:

  • positions

  • box_vectors

  • velocities

  • weights

Which is what you would expect for a constant pressure molecular dynamics simulation where you have the positions of the atoms, the box size, and velocities of the atoms.

The particulars for what “fields” a trajectory in general has are not important but this important use-case is directly supported in the WepyHDF5 format.

In any such simulation, however, the ‘weights’ field will appear since this is the weight of the walker of this frame and is a value important to weighted ensemble and not the underlying dynamics.

The naive approach to these fields is that each is a dataset of dimension (n_frames, feature_vector_shape[0], …) where the first dimension is the cycle_idx and the rest of the dimensions are determined by the atomic feature vector for each field for a single frame.

For example, the positions for a molecular simulation with 100 atoms with x, y, and z coordinates that ran for 1000 cycles would be a dataset of the shape (1000, 100, 3). Similarly the box vectors would be (1000, 3, 3) and the weights would be (1000, 1).

This uniformity vastly simplifies accessing and adding new variables and requires that individual state values in walkers always be arrays with shapes, even when they are single values (e.g. energy). The exception being the weight which is handled separately.

However, this situation is actually more complex to allow for special features.

First of all is the presence of compound fields which allow nesting of multiple groups.

The above “trajectory fields” would have identifiers such as the literal strings ‘positions’ and ‘box_vectors’, while a compound field would have an identifier ‘observables/rmsd’ or ‘alt_reps/binding_site’.

Use of trajectory field names using the ‘/’ path separator will automatically make a field a group and the last element of the field name the dataset. So for the observables example we might have:

  • 0

    • observables

      • rmsd

      • sasa

Where the rmsd would be accessed as a trajectory field of trajectory 0 as ‘observables/rmsd’ and the solvent accessible surface area as ‘observables/sasa’.

This example introduces how the WepyHDF5 format is not only useful for storing data produced by simulation but also in the analysis of that data and computation of by-frame quantities.

The ‘observables’ compound group key prefix is special and will be used in the ‘compute_observables’ method.

The other special compound group key prefix is ‘alt_reps’ which is used for particle simulations to store “alternate representation” of the positions. This is useful in cooperation with the next feature of wepy trajectory fields to allow for more economical storage of data.

The next feature (and complication of the format) is the allowance for sparse fields. As the fields were introduced we said that they should have as many feature vectors as there are frames for the simulation. In the example however, you will notice that storing both the full atomic positions and velocities for a long simulation requires a heavy storage burden.

So perhaps you only want to store the velocities (or forces) every 100 frames so that you can be able to restart a simulation form midway through the simulation. This is achieved through sparse fields.

A sparse field is no longer a dataset but a group with two items:

  • _sparse_idxs

  • data

The ‘_sparse_idxs’ are simply a dataset of integers that assign each element of the ‘data’ dataset to a frame index. Using the above example we run a simulation for 1000 frames with 100 atoms and we save the velocities every 100 frames we would have a ‘velocities/data’ dataset of shape (100, 100, 3) which is 10 times less data than if it were saved every frame.

While this complicates the storage format use of the proper API methods should be transparent whether you are returning a sparse field or not.

As alluded to above the use of sparse fields can be used for more than just accessory fields. In many simulations, such as those with full atomistic simulations of proteins in solvent we often don’t care about the dynamics of most of the atoms in the simulation and so would like to not have to save them.

The ‘alt_reps’ compound field is meant to solve this. For example, the WepyHDF5Reporter supports a special option to save only a subset of the atoms in the main ‘positions’ field but also to save the full atomic system as an alternate representation, which is the field name ‘alt_reps/all_atoms’. So that you can still save the full system every once in a while but be economical in what positions you save every single frame.

Note that there really isn’t a way to achieve this with other formats. You either make a completely new trajectory with only the atoms of interest and now you are duplicating those in two places, or you duplicate and then filter your full systems trajectory file and rely on some sort of index to always live with it in the filesystem, which is a very precarious scenario. The situation is particularly hopeless for weighted ensemble trajectories.

Init Walkers

The data stored in the ‘trajectories’ section is the data that is returned after running dynamics in a cycle. Since we view the WepyHDF5 as a completely self-contained format for simulations it seems negligent to rely on outside sources (such as the filesystem) for the initial structures that seeded the simulations. These states (and weights) can be stored in this group.

The format of this group is identical to the one for trajectories except that there is only one frame for each slot and so the shape of the datasets for each field is just the shape of the feature vector.

Record Groups

TODO: add reference to reference groups

The last five items are what are called ‘record groups’ and all follow the same format.

Each record group contains itself a number of datasets, where the names of the datasets correspond to the ‘field names’ from the record group specification. So each record groups is simply a key-value store where the values must be datasets.

For instance the fields in the ‘resampling’ (which is particularly important as it encodes the branching structure) record group for a WExplore resampler simulation are:

  • step_idx

  • walker_idx

  • decision_id

  • target_idxs

  • region_assignment

Where the ‘step_idx’ is an integer specifying which step of resampling within the cycle the resampling action took place (the cycle index is metadata for the group). The ‘walker_idx’ is the index of the walker that this action was assigned to. The ‘decision_id’ is an integer that is related to an enumeration of decision types that encodes which discrete action is to be taken for this resampling event (the enumeration is in the ‘decision’ item of the run groups). The ‘target_idxs’ is a variable length 1-D array of integers which assigns the results of the action to specific target ‘slots’ (which was discussed for the ‘trajectories’ run group). And the ‘region_assignment’ is specific to WExplore which reports on which region the walker was in at that time, and is a variable length 1-D array of integers.

Additionally, record groups are broken into two types:

  • continual

  • sporadic

Continual records occur once per cycle and so there is no extra indexing necessary.

Sporadic records can happen multiple or zero times per cycle and so require a special index for them which is contained in the extra dataset ‘_cycle_idxs’.

It is worth noting that the underlying methods for each record group are general. So while these are the official wepy record groups that are supported if there is a use-case that demands a new record group it is a fairly straightforward task from a developers perspective.

wepy.hdf5.TOPOLOGY = 'topology'

Default header apparatus dataset. The molecular topology dataset.

wepy.hdf5.SETTINGS = '_settings'

Name of the settings group in the header group.

wepy.hdf5.RUNS = 'runs'

The group name for runs.

wepy.hdf5.RUN_IDX = 'run_idx'

Metadata field for run groups for the run index within this file.

wepy.hdf5.RUN_START_SNAPSHOT_HASH = 'start_snapshot_hash'

Metadata field for a run that corresponds to the hash of the starting simulation snapshot in orchestration.

wepy.hdf5.RUN_END_SNAPSHOT_HASH = 'end_snapshot_hash'

Metadata field for a run that corresponds to the hash of the ending simulation snapshot in orchestration.

wepy.hdf5.TRAJ_IDX = 'traj_idx'

Metadata field for trajectory groups for the trajectory index in that run.

wepy.hdf5.CYCLE_IDX = 'cycle_idx'

String for setting the names of cycle indices in records and miscellaneous situations.

wepy.hdf5.SPARSE_FIELDS = 'sparse_fields'

Settings field name for sparse field trajectory field flags.

wepy.hdf5.N_ATOMS = 'n_atoms'

Settings field name group for the number of atoms in the default positions field.

wepy.hdf5.N_DIMS_STR = 'n_dims'

Settings field name for positions field spatial dimensions.

wepy.hdf5.MAIN_REP_IDXS = 'main_rep_idxs'

Settings field name for the indices of the full apparatus topology in the default positions trajectory field.

wepy.hdf5.ALT_REPS_IDXS = 'alt_reps_idxs'

Settings field name for the different ‘alt_reps’. The indices of the atoms from the full apparatus topology for each.

wepy.hdf5.FIELD_FEATURE_SHAPES_STR = 'field_feature_shapes'

Settings field name for the trajectory field shapes.

wepy.hdf5.FIELD_FEATURE_DTYPES_STR = 'field_feature_dtypes'

Settings field name for the trajectory field data types.

wepy.hdf5.UNITS = 'units'

Settings field name for the units of the trajectory fields.

wepy.hdf5.RECORD_FIELDS = 'record_fields'

Settings field name for the record fields that are to be included in the truncated listing of record group fields.

wepy.hdf5.CONTINUATIONS = 'continuations'

Settings field name for the continuations relationships between runs.

wepy.hdf5.TRAJECTORIES = 'trajectories'

Run field name for the trajectories group.

wepy.hdf5.INIT_WALKERS = 'init_walkers'

Run field name for the initial walkers group.

wepy.hdf5.DECISION = 'decision'

Run field name for the decision enumeration group.

wepy.hdf5.RESAMPLING = 'resampling'

Record group run field name for the resampling records

wepy.hdf5.RESAMPLER = 'resampler'

Record group run field name for the resampler records

wepy.hdf5.WARPING = 'warping'

Record group run field name for the warping records

wepy.hdf5.PROGRESS = 'progress'

Record group run field name for the progress records

wepy.hdf5.BC = 'boundary_conditions'

Record group run field name for the boundary conditions records

wepy.hdf5.NONE_STR = 'None'

String signifying a field of unspecified shape. Used for serializing the None python object.

wepy.hdf5.CYCLE_IDXS = '_cycle_idxs'

Group name for the cycle indices of sporadic records.

wepy.hdf5.SPORADIC_RECORDS = ('resampler', 'warping', 'resampling', 'boundary_conditions')

Enumeration of the record groups that are sporadic.

wepy.hdf5.N_DIMS = 3

Number of dimensions for the default positions.

wepy.hdf5.WEIGHTS = 'weights'

The field name for the frame weights.

wepy.hdf5.POSITIONS = 'positions'

The field name for the default positions.

wepy.hdf5.BOX_VECTORS = 'box_vectors'

The field name for the default box vectors.

wepy.hdf5.VELOCITIES = 'velocities'

The field name for the default velocities.

wepy.hdf5.FORCES = 'forces'

The field name for the default forces.

wepy.hdf5.TIME = 'time'

The field name for the default time.

wepy.hdf5.KINETIC_ENERGY = 'kinetic_energy'

The field name for the default kinetic energy.

wepy.hdf5.POTENTIAL_ENERGY = 'potential_energy'

The field name for the default potential energy.

wepy.hdf5.BOX_VOLUME = 'box_volume'

The field name for the default box volume.

wepy.hdf5.PARAMETERS = 'parameters'

The field name for the default parameters.

wepy.hdf5.PARAMETER_DERIVATIVES = 'parameter_derivatives'

The field name for the default parameter derivatives.

wepy.hdf5.ALT_REPS = 'alt_reps'

The field name for the default compound field observables.

wepy.hdf5.OBSERVABLES = 'observables'

The field name for the default compound field observables.

wepy.hdf5.WEIGHT_SHAPE = (1,)

Weights feature vector shape.

wepy.hdf5.WEIGHT_DTYPE

alias of float

wepy.hdf5.FIELD_FEATURE_SHAPES = (('time', (1,)), ('box_vectors', (3, 3)), ('box_volume', (1,)), ('kinetic_energy', (1,)), ('potential_energy', (1,)))

Default shapes for the default fields.

wepy.hdf5.FIELD_FEATURE_DTYPES = (('positions', <class 'float'>), ('velocities', <class 'float'>), ('forces', <class 'float'>), ('time', <class 'float'>), ('box_vectors', <class 'float'>), ('box_volume', <class 'float'>), ('kinetic_energy', <class 'float'>), ('potential_energy', <class 'float'>))

Default data types for the default fields.

wepy.hdf5.POSITIONS_LIKE_FIELDS = ('velocities', 'forces')

Default trajectory fields which are the same shape as the main positions field.

wepy.hdf5.DATA = 'data'

Name of the dataset in sparse trajectory fields.

wepy.hdf5.SPARSE_IDXS = '_sparse_idxs'

Name of the dataset that indexes sparse trajectory fields.

wepy.hdf5._iter_field_paths(grp)[source]

Return all subgroup field name paths from a group.

Useful for compound fields. For example if you have the group observables with multiple subfields:

  • observables - rmsd - sasa

Passing the h5py group ‘observables’ will return the full field names for each subfield:

  • ‘observables/rmsd’

  • ‘observables/sasa’

Parameters

grp (h5py.Group) – The group to enumerate subfield names for.

Returns

subfield_names – The full names for the subfields of the group.

Return type

list of str

class wepy.hdf5.WepyHDF5(filename, mode='x', topology=None, units=None, sparse_fields=None, feature_shapes=None, feature_dtypes=None, n_dims=None, alt_reps=None, main_rep_idxs=None, swmr_mode=False, expert_mode=False)[source]

Bases: object

Wrapper for h5py interface to an HDF5 file object for creation and access of WepyHDF5 data.

This is the primary implementation of the API for creating, accessing, and modifying data in an HDF5 file that conforms to the WepyHDF5 specification.

Constructor for the WepyHDF5 class.

Initialize a new Wepy HDF5 file. This will create an h5py.File object.

The File will be closed after construction by default.

mode: r Readonly, file must exist r+ Read/write, file must exist w Create file, truncate if exists x or w- Create file, fail if exists a Read/write if exists, create otherwise

Parameters
  • filename (str) – File path

  • mode (str) – Mode specification for opening the HDF5 file.

  • topology (str) – JSON string representing topology of system being simulated.

  • units (dict of str : str, optional) – Mapping of trajectory field names to string specs for units.

  • sparse_fields (list of str, optional) – List of trajectory fields that should be initialized as sparse.

  • feature_shapes (dict of str : shape_spec, optional) – Mapping of trajectory fields to their shape spec for initialization.

  • feature_dtypes (dict of str : dtype_spec, optional) – Mapping of trajectory fields to their shape spec for initialization.

  • n_dims (int, default: 3) – Set the number of spatial dimensions for the default positions trajectory field.

  • alt_reps (dict of str : list of int, optional) – Specifies that there will be ‘alt_reps’ of positions each named by the keys of this mapping and containing the indices in each value list.

  • main_rep_idxs (list of int, optional) – The indices of atom positions to save as the main ‘positions’ trajectory field. Defaults to all atoms.

  • expert_mode (bool) – If True no initialization is performed other than the setting of the filename. Useful mainly for debugging.

Raises
Warns

If initialization data was given but the file was opened in a read mode.

MODES = ('r', 'r+', 'w', 'w-', 'x', 'a')

The recognized modes for opening the WepyHDF5 file.

WRITE_MODES = ('r+', 'w', 'w-', 'x', 'a')
property swmr_mode
_create_init()[source]

Creation mode constructor.

Completely overwrite the data in the file. Reinitialize the values and set with the new ones if given.

_read_write_init()[source]

Read-write mode constructor.

_add_init()[source]

The addition mode constructor.

Create the dataset if it doesn’t exist and put it in r+ mode, otherwise, just open in r+ mode.

_read_init()[source]

Read mode constructor.

_set_default_init_field_attributes(n_dims=None)[source]

Sets the feature_shapes and feature_dtypes to be the default for this module. These will be used to initialize field datasets when no given during construction (i.e. for sparse values)

Parameters

n_dims (int) –

_get_field_path_grp(run_idx, traj_idx, field_path)[source]

Given a field path for the trajectory returns the group the field’s dataset goes in and the key for the field name in that group.

The field path for a simple field is just the name of the field and for a compound field it is the compound field group name with the subfield separated by a ‘/’ like ‘observables/observable1’ where ‘observables’ is the compound field group and ‘observable1’ is the subfield name.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • field_path (str) –

Returns

  • group (h5py.Group)

  • field_name (str)

_init_continuations()[source]

This will either create a dataset in the settings for the continuations or if continuations already exist it will reinitialize them and delete the data that exists there.

Returns

continuation_dset

Return type

h5py.Dataset

_add_run_init(run_idx, continue_run=None)[source]

Routines for creating a run includes updating and setting object global variables, increasing the counter for the number of runs.

Parameters
  • run_idx (int) –

  • continue_run (int) – Index of the run to continue.

_add_init_walkers(init_walkers_grp, init_walkers)[source]

Adds the run field group for the initial walkers.

Parameters
  • init_walkers_grp (h5py.Group) – The group to add the walker data to.

  • init_walkers (list of objects implementing the Walker interface) – The walkers to save in the group

_init_run_sporadic_record_grp(run_idx, run_record_key, fields)[source]

Initialize a sporadic record group for a run.

Parameters
  • run_idx (int) –

  • run_record_key (str) – The record group name.

  • fields (list of field specs) – Each field spec is a 3-tuple of (field_name : str, field_shape : shape_spec, field_dtype : dtype_spec)

Returns

record_group

Return type

h5py.Group

_init_run_continual_record_grp(run_idx, run_record_key, fields)[source]

Initialize a continual record group for a run.

Parameters
  • run_idx (int) –

  • run_record_key (str) – The record group name.

  • fields (list of field specs) – Each field spec is a 3-tuple of (field_name : str, field_shape : shape_spec, field_dtype : dtype_spec)

Returns

record_group

Return type

h5py.Group

_init_run_records_field(run_idx, run_record_key, field_name, field_shape, field_dtype)[source]

Initialize a single field for a run record group.

Parameters
  • run_idx (int) –

  • run_record_key (str) – The name of the record group.

  • field_name (str) – The name of the field in the record group.

  • field_shape (tuple of int) – The shape of the dataset for the field.

  • field_dtype (dtype_spec) – An h5py recognized data type.

Returns

dataset

Return type

h5py.Dataset

_is_sporadic_records(run_record_key)[source]

Tests whether a record group is sporadic or not.

Parameters

run_record_key (str) – Record group name.

Returns

is_sporadic – True if the record group is sporadic False if not.

Return type

bool

_init_traj_field(run_idx, traj_idx, field_path, feature_shape, dtype)[source]

Initialize a trajectory field.

Initialize a data field in the trajectory to be empty but resizeable.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • field_path (str) – Field name specification.

  • feature_shape (shape_spec) – Specification of shape of a feature vector of the field.

  • dtype (dtype_spec) – Specification of the feature vector datatype.

_init_contiguous_traj_field(run_idx, traj_idx, field_path, shape, dtype)[source]

Initialize a contiguous (non-sparse) trajectory field.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • field_path (str) – Field name specification.

  • feature_shape (tuple of int) – Shape of the feature vector of the field.

  • dtype (dtype_spec) – H5py recognized datatype

_init_sparse_traj_field(run_idx, traj_idx, field_path, shape, dtype)[source]
Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • field_path (str) – Field name specification.

  • feature_shape (shape_spec) – Specification for the shape of the feature.

  • dtype (dtype_spec) – Specification for the dtype of the feature.

_init_traj_fields(run_idx, traj_idx, field_paths, field_feature_shapes, field_feature_dtypes)[source]

Initialize a number of fields for a trajectory.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • field_paths (list of str) – List of field names.

  • field_feature_shapes (list of shape_specs) –

  • field_feature_dtypes (list of dtype_specs) –

_add_traj_field_data(run_idx, traj_idx, field_path, field_data, sparse_idxs=None)[source]

Add a trajectory field to a trajectory.

If the sparse indices are given the field will be created as a sparse field otherwise a normal one.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • field_path (str) – Field name.

  • field_data (numpy.array) – The data array to set for the field.

  • sparse_idxs (arraylike of int of shape (1,)) – List of cycle indices that the data corresponds to.

_extend_contiguous_traj_field(run_idx, traj_idx, field_path, field_data)[source]

Add multiple new frames worth of data to the end of an existing contiguous (non-sparse)trajectory field.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • field_path (str) – Field name

  • field_data (numpy.array) – The frames of data to add.

_extend_sparse_traj_field(run_idx, traj_idx, field_path, values, sparse_idxs)[source]

Add multiple new frames worth of data to the end of an existing contiguous (non-sparse)trajectory field.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • field_path (str) – Field name

  • values (numpy.array) – The frames of data to add.

  • sparse_idxs (list of int) – The cycle indices the values correspond to.

_add_sparse_field_flag(field_path)[source]

Register a trajectory field as sparse in the header settings.

Parameters

field_path (str) – Name of the trajectory field you want to flag as sparse

_add_field_feature_shape(field_path, field_feature_shape)[source]

Add the shape to the header settings for a trajectory field.

Parameters
  • field_path (str) – The name of the trajectory field you want to set for.

  • field_feature_shape (shape_spec) – The shape spec to serialize as a dataset.

_add_field_feature_dtype(field_path, field_feature_dtype)[source]

Add the data type to the header settings for a trajectory field.

Parameters
  • field_path (str) – The name of the trajectory field you want to set for.

  • field_feature_dtype (dtype_spec) – The dtype spec to serialize as a dataset.

_set_field_feature_shape(field_path, field_feature_shape)[source]

Add the trajectory field shape to header settings or set the value.

Parameters
  • field_path (str) – The name of the trajectory field you want to set for.

  • field_feature_shape (shape_spec) – The shape spec to serialize as a dataset.

_set_field_feature_dtype(field_path, field_feature_dtype)[source]

Add the trajectory field dtype to header settings or set the value.

Parameters
  • field_path (str) – The name of the trajectory field you want to set for.

  • field_feature_dtype (dtype_spec) – The dtype spec to serialize as a dataset.

_extend_run_record_data_field(run_idx, run_record_key, field_name, field_data)[source]

Primitive record append method.

Adds data for a single field dataset in a run records group. This is done without paying attention to whether it is sporadic or continual and is supposed to be only the data write method.

Parameters
  • run_idx (int) –

  • run_record_key (str) – Name of the record group.

  • field_name (str) – Name of the field in the record group to add to.

  • field_data (arraylike) – The data to add to the field.

_run_record_namedtuple(run_record_key)[source]

Generate a namedtuple record type for a record group.

The class name will be formatted like ‘{}_Record’ where the {} will be replaced with the name of the record group.

Parameters

run_record_key (str) – Name of the record group

Returns

RecordType – The record type to generate records for this record group.

Return type

namedtuple

_convert_record_field_to_table_column(run_idx, run_record_key, record_field)[source]

Converts a dataset of feature vectors to more palatable values for use in external datasets.

For single value feature vectors it unwraps them into single values.

For 1-D feature vectors it casts them as tuples.

Anything of higher rank will raise an error.

Parameters
  • run_idx (int) –

  • run_record_key (str) – Name of the record group

  • record_field (str) – Name of the field of the record group

Returns

record_dset – Table-ified values

Return type

list

Raises

TypeError – If the field feature vector shape rank is greater than 1.

_convert_record_fields_to_table_columns(run_idx, run_record_key)[source]

Convert record group data to truncated namedtuple records.

This uses the specified record fields from the header settings to choose which record group fields to apply this to.

Does no checking to make sure the fields are “table-ifiable”. If a field is not it will raise a TypeError.

Parameters
  • run_idx (int) –

  • run_record_key (str) – The name of the record group

Returns

table_fields – Mapping of the record group field to the table-ified values.

Return type

dict of str : list

_make_records(run_record_key, cycle_idxs, fields)[source]

Generate a list of proper (nametuple) records for a record group.

Parameters
  • run_record_key (str) – Name of the record group

  • cycle_idxs (list of int) – The cycle indices you want to get records for.

  • fields (list of str) – The fields to make record entries for.

Returns

records

Return type

list of namedtuple objects

_run_records_sporadic(run_idxs, run_record_key)[source]

Generate records for a sporadic record group for a multi-run contig.

If multiple run indices are given assumes that these are a contig (e.g. the second run index is a continuation of the first and so on). This method is considered low-level and does no checking to make sure this is true.

The cycle indices of records from “continuation” runs will be modified so as the records will be indexed as if they are a single run.

Uses the record fields settings to decide which fields to use.

Parameters
  • run_idxs (list of int) – The indices of the runs in the order they are in the contig

  • run_record_key (str) – Name of the record group

Returns

records

Return type

list of namedtuple objects

_run_records_continual(run_idxs, run_record_key)[source]

Generate records for a continual record group for a multi-run contig.

If multiple run indices are given assumes that these are a contig (e.g. the second run index is a continuation of the first and so on). This method is considered low-level and does no checking to make sure this is true.

The cycle indices of records from “continuation” runs will be modified so as the records will be indexed as if they are a single run.

Uses the record fields settings to decide which fields to use.

Parameters
  • run_idxs (list of int) – The indices of the runs in the order they are in the contig

  • run_record_key (str) – Name of the record group

Returns

records

Return type

list of namedtuple objects

_get_contiguous_traj_field(run_idx, traj_idx, field_path, frames=None)[source]

Access actual data for a trajectory field.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • field_path (str) – Trajectory field name to access

  • frames (list of int, optional) – The indices of the frames to return if you don’t want all of them.

Returns

field_data – The data requested for the field.

Return type

arraylike

_get_sparse_traj_field(run_idx, traj_idx, field_path, frames=None, masked=True)[source]

Access actual data for a trajectory field.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • field_path (str) – Trajectory field name to access

  • frames (list of int, optional) – The indices of the frames to return if you don’t want all of them.

  • masked (bool) – If True returns the array data as numpy masked array, and only the available values if False.

Returns

field_data – The data requested for the field.

Return type

arraylike

_add_run_field(run_idx, field_path, data, sparse_idxs=None, force=False)[source]

Add a trajectory field to all trajectories in a run.

By enforcing adding it to all trajectories at one time we promote in-run consistency.

Parameters
  • run_idx (int) –

  • field_path (str) – Name to set the trajectory field as. Can be compound.

  • data (arraylike of shape (n_trajectories, n_cycles, feature_vector_shape[0],..)) – The data for all trajectories to be added.

  • sparse_idxs (list of int) – If the data you are adding is sparse specify which cycles to apply them to.

If ‘force’ is turned on, no checking for constraints will be done.

_add_field(field_path, data, sparse_idxs=None, force=False)[source]

Add a trajectory field to all runs in a file.

Parameters
  • field_path (str) – Name of trajectory field

  • data (list of arraylike) – Each element of this list corresponds to a single run. The elements of which are arraylikes of shape (n_trajectories, n_cycles, feature_vector_shape[0],…) for each run.

  • sparse_idxs (list of list of int) – The list of cycle indices to set for the sparse fields. If None, no trajectories are set as sparse.

property filename

The path to the underlying HDF5 file.

open(mode=None)[source]

Open the underlying HDF5 file for access.

Parameters

mode (str) – Valid mode spec. Opens the HDF5 file in this mode if given otherwise uses the existing mode.

close()[source]

Close the underlying HDF5 file.

property mode

The WepyHDF5 mode this object was created with.

set_mode(mode)[source]

Set the mode for opening the file with.

property h5_mode

The h5py.File mode the HDF5 file currently has.

_set_h5_mode(h5_mode)[source]

Set the mode to open the HDF5 file with.

This really shouldn’t be set without using the main wepy mode as they need to be aligned.

property h5

The underlying h5py.File object.

run(run_idx)[source]

Get the h5py.Group for a run.

Parameters

run_idx (int) –

Returns

run_group

Return type

h5py.Group

traj(run_idx, traj_idx)[source]

Get an h5py.Group trajectory group.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

Returns

traj_group

Return type

h5py.Group

run_trajs(run_idx)[source]

Get the trajectories group for a run.

Parameters

run_idx (int) –

Returns

trajectories_grp

Return type

h5py.Group

property runs

The runs group.

run_grp(run_idx)[source]

A group for a single run.

run_start_snapshot_hash(run_idx)[source]

Hash identifier for the starting snapshot of a run from orchestration.

run_end_snapshot_hash(run_idx)[source]

Hash identifier for the ending snapshot of a run from orchestration.

set_run_start_snapshot_hash(run_idx, snaphash)[source]

Set the starting snapshot hash identifier for a run from orchestration.

set_run_end_snapshot_hash(run_idx, snaphash)[source]

Set the ending snapshot hash identifier for a run from orchestration.

property settings_grp

The header settings group.

decision_grp(run_idx)[source]

Get the decision enumeration group for a run.

Parameters

run_idx (int) –

Returns

decision_grp

Return type

h5py.Group

init_walkers_grp(run_idx)[source]

Get the group for the initial walkers for a run.

Parameters

run_idx (int) –

Returns

init_walkers_grp

Return type

h5py.Group

records_grp(run_idx, run_record_key)[source]

Get a record group h5py.Group for a run.

Parameters
  • run_idx (int) –

  • run_record_key (str) – Name of the record group

Returns

run_record_group

Return type

h5py.Group

resampling_grp(run_idx)[source]

Get this record group for a run.

Parameters

run_idx (int) –

Returns

run_record_group

Return type

h5py.Group

resampler_grp(run_idx)[source]

Get this record group for a run.

Parameters

run_idx (int) –

Returns

run_record_group

Return type

h5py.Group

warping_grp(run_idx)[source]

Get this record group for a run.

Parameters

run_idx (int) –

Returns

run_record_group

Return type

h5py.Group

bc_grp(run_idx)[source]

Get this record group for a run.

Parameters

run_idx (int) –

Returns

run_record_group

Return type

h5py.Group

progress_grp(run_idx)[source]

Get this record group for a run.

Parameters

run_idx (int) –

Returns

run_record_group

Return type

h5py.Group

iter_runs(idxs=False, run_sel=None)[source]

Generator for iterating through the runs of a file.

Parameters
  • idxs (bool) – If True yields the run index in addition to the group.

  • run_sel (list of int, optional) – If not None should be a list of the runs you want to iterate over.

Yields
  • run_idx (int, if idxs is True)

  • run_group (h5py.Group)

iter_trajs(idxs=False, traj_sel=None)[source]

Generator for iterating over trajectories in a file.

Parameters
  • idxs (bool) – If True returns a tuple of the run index and trajectory index in addition to the trajectory group.

  • traj_sel (list of int, optional) – If not None is a list of tuples of (run_idx, traj_idx) selecting which trajectories to iterate over.

Yields
  • traj_id (tuple of int, if idxs is True) – A tuple of (run_idx, traj_idx) for the group

  • trajectory (h5py.Group)

iter_run_trajs(run_idx, idxs=False)[source]

Iterate over the trajectories of a run.

Parameters
  • run_idx (int) –

  • idxs (bool) – If True returns a tuple of the run index and trajectory index in addition to the trajectory group.

Returns

iter_trajs_generator

Return type

generator for the iter_trajs method

property defined_traj_field_names

A list of the settings defined field names all trajectories have in the file.

property observable_field_names

Returns a list of the names of the observables that all trajectories have.

If this encounters observable fields that don’t occur in all trajectories (inconsistency) raises an inconsistency error.

_check_traj_field_consistency(field_names)[source]

Checks that every trajectory has the given fields across the entire dataset.

Parameters

field_names (list of str) – The field names to check for.

Returns

consistent – True if all trajs have the fields, False otherwise

Return type

bool

property record_fields

The record fields for each record group which are selected for inclusion in the truncated records.

These are the fields which are considered to be table-ified.

Returns

record_fields – Mapping of record group name to alist of the record group fields.

Return type

dict of str : list of str

property sparse_fields

The trajectory fields that are sparse.

property main_rep_idxs

The indices of the atoms included from the full topology in the default ‘positions’ trajectory

property alt_reps_idxs

Mapping of the names of the alt reps to the indices of the atoms from the topology that they include in their datasets.

property alt_reps

Names of the alt reps.

property field_feature_shapes

Mapping of the names of the trajectory fields to their feature vector shapes.

property field_feature_dtypes

Mapping of the names of the trajectory fields to their feature vector numpy dtypes.

property continuations

The continuation relationships in this file.

property metadata

File metadata (h5py.attrs).

decision_enum(run_idx)[source]

Mapping of decision enumerated names to their integer representations.

Parameters

run_idx (int) –

Returns

decision_enum – Mapping of the decision ID string to the integer representation.

Return type

dict of str : int

See also

WepyHDF5.decision_value_names

for the reverse mapping

decision_value_names(run_idx)[source]

Mapping of the integer values for decisions to the decision ID strings.

Parameters

run_idx (int) –

Returns

decision_enum – Mapping of the decision integer to the decision ID string representation.

Return type

dict of int : str

See also

WepyHDF5.decision_enum

for the reverse mapping

get_topology(alt_rep='positions')[source]

Get the JSON topology for a particular represenation of the positions.

By default gives the topology for the main ‘positions’ field (when alt_rep ‘positions’). To get the full topology the file was initialized with set alt_rep to None. Topologies for alternative representations (subfields of ‘alt_reps’) can be obtained by passing in the key for that alt_rep. For example, ‘all_atoms’ for the field in alt_reps called ‘all_atoms’.

Parameters

alt_rep (str) – The base name of the alternate representation, or ‘positions’, or None.

Returns

topology – The JSON topology string for the representation.

Return type

str

property topology

The topology for the full simulated system.

May not be the main representation in the POSITIONS field; for that use the get_topology method.

Returns

topology – The JSON topology string for the full representation.

Return type

str

get_mdtraj_topology(alt_rep='positions')[source]

Get an mdtraj.Topology object for a system representation.

By default gives the topology for the main ‘positions’ field (when alt_rep ‘positions’). To get the full topology the file was initialized with set alt_rep to None. Topologies for alternative representations (subfields of ‘alt_reps’) can be obtained by passing in the key for that alt_rep. For example, ‘all_atoms’ for the field in alt_reps called ‘all_atoms’.

Parameters

alt_rep (str) – The base name of the alternate representation, or ‘positions’, or None.

Returns

topology – The JSON topology string for the full representation.

Return type

str

initial_walker_fields(run_idx, fields, walker_idxs=None)[source]

Get fields from the initial walkers of the simulation.

Parameters
  • run_idx (int) – Run to get initial walkers for.

  • fields (list of str) – Names of the fields you want to retrieve.

  • walker_idxs (None or list of int) – If None returns all of the walkers fields, otherwise a list of ints that are a selection from those walkers.

Returns

walker_fields – Dictionary mapping fields to the values for all walkers. Frames will be either in counting order if no indices were requested or the order of the walker indices as given.

Return type

dict of str : array of shape

initial_walkers_to_mdtraj(run_idx, walker_idxs=None, alt_rep='positions')[source]

Generate an mdtraj Trajectory from a trace of frames from the runs.

Uses the default fields for positions (unless an alternate representation is specified) and box vectors which are assumed to be present in the trajectory fields.

The time value for the mdtraj trajectory is set to the cycle indices for each trace frame.

This is useful for converting WepyHDF5 data to common molecular dynamics data formats accessible through the mdtraj library.

Parameters
  • run_idx (int) – Run to get initial walkers for.

  • fields (list of str) – Names of the fields you want to retrieve.

  • walker_idxs (None or list of int) – If None returns all of the walkers fields, otherwise a list of ints that are a selection from those walkers.

  • alt_rep (None or str) – If None uses default ‘positions’ representation otherwise chooses the representation from the ‘alt_reps’ compound field.

Returns

traj

Return type

mdtraj.Trajectory

property num_atoms

The number of atoms in the full topology representation.

property num_dims

The number of spatial dimensions in the positions and alt_reps trajectory fields.

property num_runs

The number of runs in the file.

property num_trajs

The total number of trajectories in the entire file.

num_init_walkers(run_idx)[source]

The number of initial walkers for a run.

Parameters

run_idx (int) –

Returns

n_walkers

Return type

int

num_walkers(run_idx, cycle_idx)[source]

Get the number of walkers at a given cycle in a run.

Parameters
  • run_idx (int) –

  • cycle_idx (int) –

Returns

n_walkers

Return type

int

num_run_trajs(run_idx)[source]

The number of trajectories in a run.

Parameters

run_idx (int) –

Returns

n_trajs

Return type

int

num_run_cycles(run_idx)[source]

The number of cycles in a run.

Parameters

run_idx (int) –

Returns

n_cycles

Return type

int

num_traj_frames(run_idx, traj_idx)[source]

The number of frames in a given trajectory.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

Returns

n_frames

Return type

int

property run_idxs

The indices of the runs in the file.

run_traj_idxs(run_idx)[source]

The indices of trajectories in a run.

Parameters

run_idx (int) –

Returns

traj_idxs

Return type

list of int

run_traj_idx_tuples(runs=None)[source]

Get identifier tuples (run_idx, traj_idx) for all trajectories in all runs.

Parameters

runs (list of int, optional) – If not None, a list of run indices to restrict to.

Returns

run_traj_tuples – A listing of all trajectories by their identifying tuple of (run_idx, traj_idx).

Return type

list of tuple of int

get_traj_field_cycle_idxs(run_idx, traj_idx, field_path)[source]

Returns the cycle indices for a sparse trajectory field.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • field_path (str) – Name of the trajectory field

Returns

cycle_idxs

Return type

arraylike of int

next_run_idx()[source]

The index of the next run if it were to be added.

Because runs are named as the integer value of the order they were added this gives the index of the next run that would be added.

Returns

next_run_idx

Return type

int

next_run_traj_idx(run_idx)[source]

The index of the next trajectory for this run.

Parameters

run_idx (int) –

Returns

next_traj_idx

Return type

int

is_run_contig(run_idxs)[source]

This method checks that if a given list of run indices is a valid contig or not.

Parameters

run_idxs (list of int) – The run indices that would make up the contig in order.

Returns

is_contig

Return type

bool

clone(path, mode='x')[source]

Clone the header information of this file into another file.

Clones this WepyHDF5 file without any of the actual runs and run data. This includes the topology, units, sparse_fields, feature shapes and dtypes, alt_reps, and main representation information.

This method will flush the buffers for this file.

Does not preserve metadata pertaining to inter-run relationships like continuations.

Parameters
  • path (str) – File path to save the new file.

  • mode (str) – The mode to open the new file with.

Returns

new_file – The handle to the new file. It will be closed.

Return type

h5py.File

Add a run from another file to this one as an HDF5 external link.

Parameters
  • filepath (str) – File path to the HDF5 file that the run is on.

  • run_idx (int) – The run index from the target file you want to link.

  • continue_run (int, optional) – The run from the linking WepyHDF5 file you want the target linked run to continue.

  • kwargs (dict) – Adds metadata (h5py.attrs) to the linked run.

Returns

linked_run_idx – The index of the linked run in the linking file.

Return type

int

Link all runs from another WepyHDF5 file.

This preserves continuations within that file. This will open the file if not already opened.

Parameters

wepy_h5_path (str) – Filepath to the file you want to link runs from.

Returns

new_run_idxs – The new run idxs from the linking file.

Return type

list of int

extract_run(filepath, run_idx, continue_run=None, run_slice=None, **kwargs)[source]

Add a run from another file to this one by copying it and truncating it if necessary.

Parameters
  • filepath (str) – File path to the HDF5 file that the run is on.

  • run_idx (int) – The run index from the target file you want to link.

  • continue_run (int, optional) – The run from the linking WepyHDF5 file you want the target linked run to continue.

  • run_slice

  • kwargs (dict) – Adds metadata (h5py.attrs) to the linked run.

Returns

linked_run_idx – The index of the linked run in the linking file.

Return type

int

extract_file_runs(wepy_h5_path, run_slices=None)[source]

Extract (copying and truncating appropriately) all runs from another WepyHDF5 file.

This preserves continuations within that file. This will open the file if not already opened.

Parameters

wepy_h5_path (str) – Filepath to the file you want to link runs from.

Returns

new_run_idxs – The new run idxs from the linking file.

Return type

list of int

join(other_h5)[source]

Given another WepyHDF5 file object does a left join on this file, renumbering the runs starting from this file.

This function uses the H5O function for copying. Data will be copied not linked.

Parameters

other_h5 (h5py.File) – File handle to the file you want to join to this one.

add_metadata(key, value)[source]

Add metadata for the whole file.

Parameters
  • key (str) –

  • value (h5py value) – h5py valid metadata value.

init_record_fields(run_record_key, record_fields)[source]

Initialize the settings record fields for a record group in the settings group.

Save which records are to be considered from a run record group’s datasets to be in the table like representation. This exists to allow there to large and small datasets for records to be stored together but allow for a more compact single table like representation to be produced for serialization.

Parameters
  • run_record_key (str) – Name of the record group you want to set this for.

  • record_fields (list of str) – Names of the fields you want to set as record fields.

init_resampling_record_fields(resampler)[source]

Initialize the record fields for this record group.

Parameters

resampler (object implementing the Resampler interface) – The resampler which contains the data for which record fields to set.

init_resampler_record_fields(resampler)[source]

Initialize the record fields for this record group.

Parameters

resampler (object implementing the Resampler interface) – The resampler which contains the data for which record fields to set.

init_bc_record_fields(bc)[source]

Initialize the record fields for this record group.

Parameters

bc (object implementing the BoundaryConditions interface) – The boundary conditions object which contains the data for which record fields to set.

init_warping_record_fields(bc)[source]

Initialize the record fields for this record group.

Parameters

bc (object implementing the BoundaryConditions interface) – The boundary conditions object which contains the data for which record fields to set.

init_progress_record_fields(bc)[source]

Initialize the record fields for this record group.

Parameters

bc (object implementing the BoundaryConditions interface) – The boundary conditions object which contains the data for which record fields to set.

add_continuation(continuation_run, base_run)[source]

Add a continuation between runs.

Parameters
  • continuation_run (int) – The run index of the run that will be continuing another

  • base_run (int) – The run that is being continued.

new_run(init_walkers, continue_run=None, **kwargs)[source]

Initialize a new run.

Parameters
  • init_walkers (list of objects implementing the Walker interface) – The walkers that will be the start of this run.

  • continue_run (int, optional) – If this run is a continuation of another set which one it is continuing.

  • kwargs (dict) – Metadata to set for the run.

Returns

run_grp – The group of the newly created run.

Return type

h5py.Group

init_run_resampling(run_idx, resampler)[source]

Initialize data for resampling records.

Initialized the run record group as well as settings for the fields.

This method also creates the decision group for the run.

Parameters
  • run_idx (int) –

  • resampler (object implementing the Resampler interface) – The resampler which contains the data for which record fields to set.

Returns

record_grp

Return type

h5py.Group

init_run_resampling_decision(run_idx, resampler)[source]

Initialize the decision group for the run resampling records.

Parameters
  • run_idx (int) –

  • resampler (object implementing the Resampler interface) – The resampler which contains the data for which record fields to set.

init_run_resampler(run_idx, resampler)[source]

Initialize data for this record group in a run.

Initialized the run record group as well as settings for the fields.

Parameters
  • run_idx (int) –

  • resampler (object implementing the Resampler interface) – The resampler which contains the data for which record fields to set.

Returns

record_grp

Return type

h5py.Group

init_run_warping(run_idx, bc)[source]

Initialize data for this record group in a run.

Initialized the run record group as well as settings for the fields.

Parameters
  • run_idx (int) –

  • bc (object implementing the BoundaryConditions interface) – The boundary conditions object which contains the data for which record fields to set.

Returns

record_grp

Return type

h5py.Group

init_run_progress(run_idx, bc)[source]

Initialize data for this record group in a run.

Initialized the run record group as well as settings for the fields.

Parameters
  • run_idx (int) –

  • bc (object implementing the BoundaryConditions interface) – The boundary conditions object which contains the data for which record fields to set.

Returns

record_grp

Return type

h5py.Group

init_run_bc(run_idx, bc)[source]

Initialize data for this record group in a run.

Initialized the run record group as well as settings for the fields.

Parameters
  • run_idx (int) –

  • bc (object implementing the BoundaryConditions interface) – The boundary conditions object which contains the data for which record fields to set.

Returns

record_grp

Return type

h5py.Group

init_run_fields_resampling(run_idx, fields)[source]

Initialize this record group fields datasets.

Parameters
  • run_idx (int) –

  • fields (list of str) – Names of the fields to initialize

Returns

record_grp

Return type

h5py.Group

init_run_fields_resampling_decision(run_idx, decision_enum_dict)[source]

Initialize the decision group for this run.

Parameters
  • run_idx (int) –

  • decision_enum_dict (dict of str : int) – Mapping of decision ID strings to integer representation.

init_run_fields_resampler(run_idx, fields)[source]

Initialize this record group fields datasets.

Parameters
  • run_idx (int) –

  • fields (list of str) – Names of the fields to initialize

Returns

record_grp

Return type

h5py.Group

init_run_fields_warping(run_idx, fields)[source]

Initialize this record group fields datasets.

Parameters
  • run_idx (int) –

  • fields (list of str) – Names of the fields to initialize

Returns

record_grp

Return type

h5py.Group

init_run_fields_progress(run_idx, fields)[source]

Initialize this record group fields datasets.

Parameters
  • run_idx (int) –

  • fields (list of str) – Names of the fields to initialize

Returns

record_grp

Return type

h5py.Group

init_run_fields_bc(run_idx, fields)[source]

Initialize this record group fields datasets.

Parameters
  • run_idx (int) –

  • fields (list of str) – Names of the fields to initialize

Returns

record_grp

Return type

h5py.Group

init_run_record_grp(run_idx, run_record_key, fields)[source]

Initialize a record group for a run.

Parameters
  • run_idx (int) –

  • run_record_key (str) – The name of the record group.

  • fields (list of str) – The names of the fields to set for the record group.

add_traj(run_idx, data, weights=None, sparse_idxs=None, metadata=None)[source]

Add a full trajectory to a run.

Parameters
  • run_idx (int) –

  • data (dict of str : arraylike) – Mapping of trajectory fields to the data for them to add.

  • weights (1-D arraylike of float) – The weights of each frame. If None defaults all frames to 1.0.

  • sparse_idxs (list of int) – Cycle indices the data corresponds to.

  • metadata (dict of str : value) – Metadata for the trajectory.

Returns

traj_grp

Return type

h5py.Group

extend_traj(run_idx, traj_idx, data, weights=None)[source]

Extend a trajectory with data for all fields.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • data (dict of str : arraylike) – The data to add for each field of the trajectory. Must all have the same first dimension.

  • weights (arraylike) – Weights for the frames of the trajectory. If None defaults all frames to 1.0.

extend_cycle_warping_records(run_idx, cycle_idx, warping_data)[source]

Add records for each field for this record group.

Parameters
  • run_idx (int) –

  • cycle_idx (int) – The cycle index these records correspond to.

  • warping_data (dict of str : arraylike) – Mapping of the record group fields to a collection of values for each field.

extend_cycle_bc_records(run_idx, cycle_idx, bc_data)[source]

Add records for each field for this record group.

Parameters
  • run_idx (int) –

  • cycle_idx (int) – The cycle index these records correspond to.

  • bc_data (dict of str : arraylike) – Mapping of the record group fields to a collection of values for each field.

extend_cycle_progress_records(run_idx, cycle_idx, progress_data)[source]

Add records for each field for this record group.

Parameters
  • run_idx (int) –

  • cycle_idx (int) – The cycle index these records correspond to.

  • progress_data (dict of str : arraylike) – Mapping of the record group fields to a collection of values for each field.

extend_cycle_resampling_records(run_idx, cycle_idx, resampling_data)[source]

Add records for each field for this record group.

Parameters
  • run_idx (int) –

  • cycle_idx (int) – The cycle index these records correspond to.

  • resampling_data (dict of str : arraylike) – Mapping of the record group fields to a collection of values for each field.

extend_cycle_resampler_records(run_idx, cycle_idx, resampler_data)[source]

Add records for each field for this record group.

Parameters
  • run_idx (int) –

  • cycle_idx (int) – The cycle index these records correspond to.

  • resampler_data (dict of str : arraylike) – Mapping of the record group fields to a collection of values for each field.

extend_cycle_run_group_records(run_idx, run_record_key, cycle_idx, fields_data)[source]

Extend data for a whole records group.

This must have the cycle index for the data it is appending as this is done for sporadic and continual datasets.

Parameters
  • run_idx (int) –

  • run_record_key (str) – Name of the record group.

  • cycle_idx (int) – The cycle index these records correspond to.

  • fields_data (dict of str : arraylike) – Mapping of the field name to the values for the records being added.

run_records(run_idx, run_record_key)[source]

Get the records for a record group for a single run.

Parameters
  • run_idx (int) –

  • run_record_key (str) – The name of the record group.

Returns

records – The list of records for the run’s record group.

Return type

list of namedtuple objects

run_contig_records(run_idxs, run_record_key)[source]

Get the records for a record group for the contig that is formed by the run indices.

This alters the cycle indices for the records so that they appear to have come from a single run. That is they are the cycle indices of the contig.

Parameters
  • run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

  • run_record_key (str) – Name of the record group.

Returns

records – The list of records for the contig’s record group.

Return type

list of namedtuple objects

run_records_dataframe(run_idx, run_record_key)[source]

Get the records for a record group for a single run in the form of a pandas DataFrame.

Parameters
  • run_idx (int) –

  • run_record_key (str) – Name of record group.

Returns

record_df

Return type

pandas.DataFrame

run_contig_records_dataframe(run_idxs, run_record_key)[source]

Get the records for a record group for a contig of runs in the form of a pandas DataFrame.

Parameters
  • run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

  • run_record_key (str) – The name of the record group.

Returns

records_df

Return type

pandas.DataFrame

resampling_records(run_idxs)[source]

Get the records this record group for the contig that is formed by the run indices.

This alters the cycle indices for the records so that they appear to have come from a single run. That is they are the cycle indices of the contig.

Parameters

run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

Returns

records – The list of records for the contig’s record group.

Return type

list of namedtuple objects

resampling_records_dataframe(run_idxs)[source]

Get the records for this record group for a contig of runs in the form of a pandas DataFrame.

Parameters

run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

Returns

records_df

Return type

pandas.DataFrame

resampler_records(run_idxs)[source]

Get the records this record group for the contig that is formed by the run indices.

This alters the cycle indices for the records so that they appear to have come from a single run. That is they are the cycle indices of the contig.

Parameters

run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

Returns

records – The list of records for the contig’s record group.

Return type

list of namedtuple objects

resampler_records_dataframe(run_idxs)[source]

Get the records for this record group for a contig of runs in the form of a pandas DataFrame.

Parameters

run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

Returns

records_df

Return type

pandas.DataFrame

warping_records(run_idxs)[source]

Get the records this record group for the contig that is formed by the run indices.

This alters the cycle indices for the records so that they appear to have come from a single run. That is they are the cycle indices of the contig.

Parameters

run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

Returns

records – The list of records for the contig’s record group.

Return type

list of namedtuple objects

warping_records_dataframe(run_idxs)[source]

Get the records for this record group for a contig of runs in the form of a pandas DataFrame.

Parameters

run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

Returns

records_df

Return type

pandas.DataFrame

bc_records(run_idxs)[source]

Get the records this record group for the contig that is formed by the run indices.

This alters the cycle indices for the records so that they appear to have come from a single run. That is they are the cycle indices of the contig.

Parameters

run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

Returns

records – The list of records for the contig’s record group.

Return type

list of namedtuple objects

bc_records_dataframe(run_idxs)[source]

Get the records for this record group for a contig of runs in the form of a pandas DataFrame.

Parameters

run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

Returns

records_df

Return type

pandas.DataFrame

progress_records(run_idxs)[source]

Get the records this record group for the contig that is formed by the run indices.

This alters the cycle indices for the records so that they appear to have come from a single run. That is they are the cycle indices of the contig.

Parameters

run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

Returns

records – The list of records for the contig’s record group.

Return type

list of namedtuple objects

progress_records_dataframe(run_idxs)[source]

Get the records for this record group for a contig of runs in the form of a pandas DataFrame.

Parameters

run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

Returns

records_df

Return type

pandas.DataFrame

run_resampling_panel(run_idx)[source]

Generate a resampling panel from the resampling records of a run.

Parameters

run_idx (int) –

Returns

resampling_panel – The panel (list of tables) of resampling records in order (cycle, step, walker)

Return type

list of list of list of namedtuple records

run_contig_resampling_panel(run_idxs)[source]

Generate a resampling panel from the resampling records of a contig, which is a series of runs.

Parameters

run_idxs (list of int) – The run indices that form a contig. (i.e. element 1 continues element 0)

Returns

resampling_panel – The panel (list of tables) of resampling records in order (cycle, step, walker)

Return type

list of list of list of namedtuple records

add_run_observable(run_idx, observable_name, data, sparse_idxs=None)[source]

Add a trajectory sub-field in the compound field “observables” for a single run.

Parameters
  • run_idx (int) –

  • observable_name (str) – What to name the observable subfield.

  • data (arraylike of shape (n_trajs, feature_vector_shape[0], ..)) – The data for all of the trajectories that will be set to this observable field.

  • sparse_idxs (list of int, optional) – If not None, specifies the cycle indices this data corresponds to.

add_traj_observable(observable_name, data, sparse_idxs=None)[source]

Add a trajectory sub-field in the compound field “observables” for an entire file, on a trajectory basis.

Parameters
  • observable_name (str) – What to name the observable subfield.

  • data (list of arraylike) – The data for each run are the elements of this argument. Each element is an arraylike of shape (n_traj_frames, feature_vector_shape[0],…) where the n_run_frames is the number of frames in trajectory.

  • sparse_idxs (list of list of int, optional) – If not None, specifies the cycle indices this data corresponds to. First by run, then by trajectory.

add_observable(observable_name, data, sparse_idxs=None)[source]

Add a trajectory sub-field in the compound field “observables” for an entire file, on a compound run and trajectory basis.

Parameters
  • observable_name (str) – What to name the observable subfield.

  • data (list of list of arraylike) – The data for each run are the elements of this argument. Each element is a list of the trajectory observable arraylikes of shape (n_traj_frames, feature_vector_shape[0],…).

  • sparse_idxs (list of list of int, optional) – If not None, specifies the cycle indices this data corresponds to. First by run, then by trajectory.

compute_observable(func, fields, args, map_func=<class 'map'>, traj_sel=None, save_to_hdf5=None, idxs=False, return_results=True)[source]

Compute an observable on the trajectory data according to a function. Optionally save that data in the observables data group for the trajectory.

Parameters
  • func (callable) – The function to apply to the trajectory fields (by cycle). Must accept a dictionary mapping string trajectory field names to a feature vector for that cycle and return an arraylike. May accept other positional arguments as well.

  • fields (list of str) – A list of trajectory field names to pass to the mapped function.

  • args (tuple) – A single tuple of arguments which will be expanded and passed to the mapped function for every evaluation.

  • map_func (callable) – The mapping function. The implementation of how to map the computation function over the data. Default is the python builtin map function. Can be a parallel implementation for example.

  • traj_sel (list of tuple, optional) – If not None, a list of trajectory identifier tuple (run_idx, traj_idx) to restrict the computation to.

  • save_to_hdf5 (None or string, optional) – If not None, a string that specifies the name of the observables sub-field that the computed values will be saved to.

  • idxs (bool) – If True will return the trajectory identifier tuple (run_idx, traj_idx) along with other return values.

  • return_results (bool) – If True will return the results of the mapping. If not using the ‘save_to_hdf5’ option, be sure to use this or results will be lost.

Returns

  • traj_id_tuples (list of tuple of int, if ‘idxs’ option is True) – A list of the tuple identifiers for each trajectory result.

  • results (list of arraylike, if ‘return_results’ option is True) – A list of arraylike feature vectors for each trajectory.

get_traj_field(run_idx, traj_idx, field_path, frames=None, masked=True)[source]

Returns a numpy array for the given trajectory field.

You can control how sparse fields are returned using the masked option. When True (default) a masked numpy array will be returned such that you can get which cycles it is from, when False an unmasked array of the data will be returned which has no cycle information.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • field_path (str) – Name of the trajectory field to get

  • frames (None or list of int) – If not None, a list of the frame indices of the trajectory to return values for.

  • masked (bool) – If true will return sparse field values as masked arrays, otherwise just returns the compacted data.

Returns

field_data – The data for the trajectory field.

Return type

arraylike

get_trace_fields(frame_tups, fields, same_order=True)[source]

Get trajectory field data for the frames specified by the trace.

Parameters
  • frame_tups (list of tuple of int) – The trace values. Each tuple is of the form (run_idx, traj_idx, frame_idx).

  • fields (list of str) – The names of the fields to get for each frame.

  • same_order (bool) – (Default = True) If True will ensure that the results will be sorted exactly as the order of the frame_tups were. If False will return them in an arbitrary implementation determined order that should be more efficient.

Returns

trace_fields – Mapping of the field names to the array of feature vectors for the trace.

Return type

dict of str : arraylike

get_run_trace_fields(run_idx, frame_tups, fields)[source]

Get trajectory field data for the frames specified by the trace within a single run.

Parameters
  • run_idx (int) –

  • frame_tups (list of tuple of int) – The trace values. Each tuple is of the form (traj_idx, frame_idx).

  • fields (list of str) – The names of the fields to get for each frame.

Returns

trace_fields – Mapping of the field names to the array of feature vectors for the trace.

Return type

dict of str : arraylike

get_contig_trace_fields(contig_trace, fields)[source]

Get field data for all trajectories of a contig for the frames specified by the contig trace.

Parameters
  • contig_trace (list of tuple of int) – The trace values. Each tuple is of the form (run_idx, frame_idx).

  • fields (list of str) – The names of the fields to get for each cycle.

Returns

contig_fields – of shape (n_cycles, n_trajs, field_feature_shape[0],…) Mapping of the field names to the array of feature vectors for contig trace.

Return type

dict of str : arraylike

iter_trajs_fields(fields, idxs=False, traj_sel=None)[source]

Generator for iterating over fields trajectories in a file.

Parameters
  • fields (list of str) – Names of the trajectory fields you want to yield.

  • idxs (bool) – If True will also return the tuple identifier of the trajectory the field data is from.

  • traj_sel (list of tuple of int) – If not None, a list of trajectory identifiers to restrict iteration over.

Yields
  • traj_identifier (tuple of int if ‘idxs’ option is True) – Tuple identifying the trajectory the data belongs to (run_idx, traj_idx).

  • fields_data (dict of str : arraylike) – Mapping of the field name to the array of feature vectors of that field for this trajectory.

traj_fields_map(func, fields, args, map_func=<class 'map'>, idxs=False, traj_sel=None)[source]

Function for mapping work onto field of trajectories.

Parameters
  • func (callable) – The function to apply to the trajectory fields (by cycle). Must accept a dictionary mapping string trajectory field names to a feature vector for that cycle and return an arraylike. May accept other positional arguments as well.

  • fields (list of str) – A list of trajectory field names to pass to the mapped function.

  • args (None or or tuple) – A single tuple of arguments which will be passed to the mapped function for every evaluation.

  • map_func (callable) – The mapping function. The implementation of how to map the computation function over the data. Default is the python builtin map function. Can be a parallel implementation for example.

  • traj_sel (list of tuple, optional) – If not None, a list of trajectory identifier tuple (run_idx, traj_idx) to restrict the computation to.

  • idxs (bool) – If True will return the trajectory identifier tuple (run_idx, traj_idx) along with other return values.

Returns

  • traj_id_tuples (list of tuple of int, if ‘idxs’ option is True) – A list of the tuple identifiers for each trajectory result.

  • results (list of arraylike) – A list of arraylike feature vectors for each trajectory.

to_mdtraj(run_idx, traj_idx, frames=None, alt_rep=None)[source]

Convert a trajectory to an mdtraj Trajectory object.

Works if the right trajectory fields are defined. Minimally this is a representation, including the ‘positions’ field or an ‘alt_rep’ subfield.

Will also set the unitcell lengths and angle if the ‘box_vectors’ field is present.

Will also set the time for the frames if the ‘time’ field is present, although this is likely not useful since walker segments have the time reset.

Parameters
  • run_idx (int) –

  • traj_idx (int) –

  • frames (None or list of int) – If not None, a list of the frames to include.

  • alt_rep (str) – If not None, an ‘alt_reps’ subfield name to use for positions instead of the ‘positions’ field.

Returns

traj

Return type

mdtraj.Trajectory

trace_to_mdtraj(trace, alt_rep=None)[source]

Generate an mdtraj Trajectory from a trace of frames from the runs.

Uses the default fields for positions (unless an alternate representation is specified) and box vectors which are assumed to be present in the trajectory fields.

The time value for the mdtraj trajectory is set to the cycle indices for each trace frame.

This is useful for converting WepyHDF5 data to common molecular dynamics data formats accessible through the mdtraj library.

Parameters
  • trace (list of tuple of int) – The trace values. Each tuple is of the form (run_idx, traj_idx, frame_idx).

  • alt_rep (None or str) – If None uses default ‘positions’ representation otherwise chooses the representation from the ‘alt_reps’ compound field.

Returns

traj

Return type

mdtraj.Trajectory

run_trace_to_mdtraj(run_idx, trace, alt_rep=None)[source]

Generate an mdtraj Trajectory from a trace of frames from the runs.

Uses the default fields for positions (unless an alternate representation is specified) and box vectors which are assumed to be present in the trajectory fields.

The time value for the mdtraj trajectory is set to the cycle indices for each trace frame.

This is useful for converting WepyHDF5 data to common molecular dynamics data formats accessible through the mdtraj library.

Parameters
  • run_idx (int) – The run the trace is over.

  • run_trace (list of tuple of int) – The trace values. Each tuple is of the form (traj_idx, frame_idx).

  • alt_rep (None or str) – If None uses default ‘positions’ representation otherwise chooses the representation from the ‘alt_reps’ compound field.

Returns

traj

Return type

mdtraj.Trajectory

_choose_rep_path(alt_rep)[source]

Given a positions specification string, gets the field name/path for it.

Parameters

alt_rep (str) – The short name (non relative path) for a representation of the positions.

Returns

  • rep_path (str) – The relative field path to that representation.

  • E.g.

  • If you give it ‘positions’ or None it will simply return

  • ’positions’, however if you ask for ‘all_atoms’ it will return

  • ’alt_reps/all_atoms’.

traj_fields_to_mdtraj(traj_fields, alt_rep='positions')[source]

Create an mdtraj.Trajectory from a traj_fields dictionary.

Parameters
  • traj_fields (dict of str : arraylike) – Dictionary of the traj fields to their values

  • alt_reps (str) – The base alt rep name for the positions representation to use for the topology, should have the corresponding alt_rep field in the traj_fields

Returns

  • traj (mdtraj.Trajectory object)

  • This is mainly a convenience function to retrieve the correct

  • topology for the positions which will be passed to the generic

  • traj_fields_to_mdtraj function.

copy_run_slice(run_idx, target_file_path, target_grp_path, run_slice=None, mode='x')[source]

Copy this run to another HDF5 file (target_file_path) at the group (target_grp_path)