wepy.hdf5 module¶

Primary wepy simulation database driver and access API using the HDF5 format.

The HDF5 Format Specification¶

As part of the wepy framework this module provides a fully-featured API for creating and accessing data generated in weighted ensemble simulations run with wepy.

The need for a special purpose format is many-fold but primarily it is the nonlinear branching structure of walker trajectories coupled with weights.

That is for standard simulations data is organized as independent linear trajectories of frames each related linearly to the one before it and after it.

In weighted ensemble due to the resampling (i.e. cloning and merging) of walkers, a single frame may have multiple ‘child’ frames.

This is the primary motivation for this format.

However, in practice it solves several other issues and itself is a more general and flexible format than for just weighted ensemble simulations.

Concretely the WepyHDF5 format is simply an informally described schema that is commensurable with the HDF5 constructs of hierarchical groups (similar to unix filesystem directories) arranged as a tree with datasets as the leaves.

The hierarchy is fairly deep and so we will progress downwards from the top and describe each broad section in turn breaking it down when necessary.

Header¶

The items right under the root of the tree are:

runs
topology
_settings

The first item ‘runs’ is itself a group that contains all of the primary data from simulations. In WepyHDF5 the run is the unit dataset. All data internal to a run is self contained. That is for multiple dependent trajectories (e.g. from cloning and merging) all exist within a single run.

This excludes metadata-like things that may be needed for interpreting this data, such as the molecular topology that imposes structure over a frame of atom positions. This information is placed in the ‘topology’ item.

The topology field has no specified internal structure at this time. However, with the current implementation of the WepyHDF5Reporter (which is the principal implementation of generating a WepyHDF5 object/file from simulations) this is simply a string dataset. This string dataset should be a JSON compliant string. The format of which is specified elsewhere and was borrowed from the mdtraj library.

Warning! this format and specification for the topology is subject to change in the future and will likely be kept unspecified indefinitely.

For most intents and purposes (which we assume to be for molecular or molecular-like simulations) the ‘topology’ item (and perhaps any other item at the top level other than those proceeded by and underscore, such as in the ‘_settings’ item) is merely useful metadata that applies to ALL runs and is not dynamical.

In the language of the orchestration module all data in ‘runs’ uses the same ‘apparatus’ which is the function that takes in the initial conditions for walkers and produces new walkers. The apparatus may differ in the specific values of parameters but not in kind. This is to facilitate runs that are continuations of other runs. For these kinds of simulations the state of the resampler, boundary conditions, etc. will not be as they were initially but are the same in kind or type.

All of the necessary type information of data in runs is kept in the ‘_settings’ group. This is used to serialize information about the data types, shapes, run to run continuations etc. This allows for the initialization of an empty (no runs) WepyHDF5 database at one time and filling of data at another time. Otherwise types of datasets would have to be inferred from the data itself, which may not exist yet.

As a convention items which are preceeded by an underscore (following the python convention) are to be considered hidden and mechanical to the proper functioning of various WepyHDF5 API features, such as sparse trajectory fields.

The ‘_settings’ is specified as a simple key-value structure, however values may be arbitrarily complex.

Runs¶

The meat of the format is contained within the runs group:

runs
- 0
- 1
- 2
- …

Under the runs group are a series of groups for each run. Runs are named according to the order in which they were added to the database.

Within a run (say ‘0’ from above) we have a number of items:

0
- init_walkers
- trajectories
- decision
- resampling
- resampler
- warping
- progress
- boundary_conditions

Trajectories¶

The ‘trajectories’ group is where the data for the frames of the walker trajectories is stored.

Even though the tree-like trajectories of weighted ensemble data may be well suited to having a tree-like storage topology we have opted to use something more familiar to the field, and have used a collection of linear “trajectories”.

This way of breaking up the trajectory data coupled with proper records of resampling (see below) allows for the imposition of a tree structure without committing to that as the data storage topology.

This allows the WepyHDF5 format to be easily used as a container format for collections of linear trajectories. While this is not supported in any real capacity it is one small step to convergence. We feel that a format that contains multiple trajectories is important for situations like weighted ensemble where trajectories are interdependent. The transition to a storage format like HDF5 however opens up many possibilities for new features for trajectories that have not occurred despite several attempts to forge new formats based on HDF5 (TODO: get references right; see work in mdtraj and MDHDF5).

Perhaps these formats have not caught on because the existing formats (e.g. XTC, DCD) for simple linear trajectories are good enough and there is little motivation to migrate.

However, by making the WepyHDF5 format (and related sub-formats to be described e.g. record groups and the trajectory format) both cover a new use case which can’t be achieved with old formats and old ones with ease.

Once users see the power of using a format like HDF5 from using wepy they may continue to use it for simpler simulations.

In any case the ‘trajectories’ in the group for weighted ensemble simulations should be thought of only as containers and not literally as trajectories. That is frame 4 does not necessarily follow from frame 3. So one may think of them more as “lanes” or “slots” for trajectory data that needs to be stitched together with the appropriate resampling records.

The routines and methods for generating contiguous trajectories from the data in WepyHDF5 are given through the ‘analysis’ module, which generates “traces” through the dataset.

With this in mind we will describe the sub-format of a trajectory now.

The ‘trajectories’ group is similar to the ‘runs’ group in that it has sub-groups whose names are numbers. These numbers however are not the order in which they are created but an index of that trajectory which are typically laid out all at once.

For a wepy simulation with a constant number of walkers you will only ever need as many trajectories/slots as there are walkers. So if you have 8 walkers then you will have trajectories 0 through 7. Concretely:

runs
- 0
  - trajectories
    - 0
    - 1
    - 2
    - 3
    - 4
    - 5
    - 6
    - 7

If we look at trajectory 0 we might see the following groups within:

positions
box_vectors
velocities
weights

Which is what you would expect for a constant pressure molecular dynamics simulation where you have the positions of the atoms, the box size, and velocities of the atoms.

The particulars for what “fields” a trajectory in general has are not important but this important use-case is directly supported in the WepyHDF5 format.

In any such simulation, however, the ‘weights’ field will appear since this is the weight of the walker of this frame and is a value important to weighted ensemble and not the underlying dynamics.

The naive approach to these fields is that each is a dataset of dimension (n_frames, feature_vector_shape[0], …) where the first dimension is the cycle_idx and the rest of the dimensions are determined by the atomic feature vector for each field for a single frame.

For example, the positions for a molecular simulation with 100 atoms with x, y, and z coordinates that ran for 1000 cycles would be a dataset of the shape (1000, 100, 3). Similarly the box vectors would be (1000, 3, 3) and the weights would be (1000, 1).

This uniformity vastly simplifies accessing and adding new variables and requires that individual state values in walkers always be arrays with shapes, even when they are single values (e.g. energy). The exception being the weight which is handled separately.

However, this situation is actually more complex to allow for special features.

First of all is the presence of compound fields which allow nesting of multiple groups.

The above “trajectory fields” would have identifiers such as the literal strings ‘positions’ and ‘box_vectors’, while a compound field would have an identifier ‘observables/rmsd’ or ‘alt_reps/binding_site’.

Use of trajectory field names using the ‘/’ path separator will automatically make a field a group and the last element of the field name the dataset. So for the observables example we might have:

0
- observables
  - rmsd
  - sasa

Where the rmsd would be accessed as a trajectory field of trajectory 0 as ‘observables/rmsd’ and the solvent accessible surface area as ‘observables/sasa’.

This example introduces how the WepyHDF5 format is not only useful for storing data produced by simulation but also in the analysis of that data and computation of by-frame quantities.

The ‘observables’ compound group key prefix is special and will be used in the ‘compute_observables’ method.

The other special compound group key prefix is ‘alt_reps’ which is used for particle simulations to store “alternate representation” of the positions. This is useful in cooperation with the next feature of wepy trajectory fields to allow for more economical storage of data.

The next feature (and complication of the format) is the allowance for sparse fields. As the fields were introduced we said that they should have as many feature vectors as there are frames for the simulation. In the example however, you will notice that storing both the full atomic positions and velocities for a long simulation requires a heavy storage burden.

So perhaps you only want to store the velocities (or forces) every 100 frames so that you can be able to restart a simulation form midway through the simulation. This is achieved through sparse fields.

A sparse field is no longer a dataset but a group with two items:

_sparse_idxs
data

The ‘_sparse_idxs’ are simply a dataset of integers that assign each element of the ‘data’ dataset to a frame index. Using the above example we run a simulation for 1000 frames with 100 atoms and we save the velocities every 100 frames we would have a ‘velocities/data’ dataset of shape (100, 100, 3) which is 10 times less data than if it were saved every frame.

While this complicates the storage format use of the proper API methods should be transparent whether you are returning a sparse field or not.

As alluded to above the use of sparse fields can be used for more than just accessory fields. In many simulations, such as those with full atomistic simulations of proteins in solvent we often don’t care about the dynamics of most of the atoms in the simulation and so would like to not have to save them.

The ‘alt_reps’ compound field is meant to solve this. For example, the WepyHDF5Reporter supports a special option to save only a subset of the atoms in the main ‘positions’ field but also to save the full atomic system as an alternate representation, which is the field name ‘alt_reps/all_atoms’. So that you can still save the full system every once in a while but be economical in what positions you save every single frame.

Note that there really isn’t a way to achieve this with other formats. You either make a completely new trajectory with only the atoms of interest and now you are duplicating those in two places, or you duplicate and then filter your full systems trajectory file and rely on some sort of index to always live with it in the filesystem, which is a very precarious scenario. The situation is particularly hopeless for weighted ensemble trajectories.

Init Walkers¶

The data stored in the ‘trajectories’ section is the data that is returned after running dynamics in a cycle. Since we view the WepyHDF5 as a completely self-contained format for simulations it seems negligent to rely on outside sources (such as the filesystem) for the initial structures that seeded the simulations. These states (and weights) can be stored in this group.

The format of this group is identical to the one for trajectories except that there is only one frame for each slot and so the shape of the datasets for each field is just the shape of the feature vector.

Record Groups¶

TODO: add reference to reference groups

The last five items are what are called ‘record groups’ and all follow the same format.

Each record group contains itself a number of datasets, where the names of the datasets correspond to the ‘field names’ from the record group specification. So each record groups is simply a key-value store where the values must be datasets.

For instance the fields in the ‘resampling’ (which is particularly important as it encodes the branching structure) record group for a WExplore resampler simulation are:

step_idx
walker_idx
decision_id
target_idxs
region_assignment

Where the ‘step_idx’ is an integer specifying which step of resampling within the cycle the resampling action took place (the cycle index is metadata for the group). The ‘walker_idx’ is the index of the walker that this action was assigned to. The ‘decision_id’ is an integer that is related to an enumeration of decision types that encodes which discrete action is to be taken for this resampling event (the enumeration is in the ‘decision’ item of the run groups). The ‘target_idxs’ is a variable length 1-D array of integers which assigns the results of the action to specific target ‘slots’ (which was discussed for the ‘trajectories’ run group). And the ‘region_assignment’ is specific to WExplore which reports on which region the walker was in at that time, and is a variable length 1-D array of integers.

Additionally, record groups are broken into two types:

continual
sporadic

Continual records occur once per cycle and so there is no extra indexing necessary.

Sporadic records can happen multiple or zero times per cycle and so require a special index for them which is contained in the extra dataset ‘_cycle_idxs’.

It is worth noting that the underlying methods for each record group are general. So while these are the official wepy record groups that are supported if there is a use-case that demands a new record group it is a fairly straightforward task from a developers perspective.

wepy.hdf5.TOPOLOGY = 'topology'¶: Default header apparatus dataset. The molecular topology dataset.

wepy.hdf5.SETTINGS = '_settings'¶: Name of the settings group in the header group.

wepy.hdf5.RUNS = 'runs'¶: The group name for runs.

wepy.hdf5.RUN_IDX = 'run_idx'¶: Metadata field for run groups for the run index within this file.

wepy.hdf5.RUN_START_SNAPSHOT_HASH = 'start_snapshot_hash'¶: Metadata field for a run that corresponds to the hash of the starting simulation snapshot in orchestration.

wepy.hdf5.RUN_END_SNAPSHOT_HASH = 'end_snapshot_hash'¶: Metadata field for a run that corresponds to the hash of the ending simulation snapshot in orchestration.

wepy.hdf5.TRAJ_IDX = 'traj_idx'¶: Metadata field for trajectory groups for the trajectory index in that run.

wepy.hdf5.CYCLE_IDX = 'cycle_idx'¶: String for setting the names of cycle indices in records and miscellaneous situations.

wepy.hdf5.SPARSE_FIELDS = 'sparse_fields'¶: Settings field name for sparse field trajectory field flags.

wepy.hdf5.N_ATOMS = 'n_atoms'¶: Settings field name group for the number of atoms in the default positions field.

wepy.hdf5.N_DIMS_STR = 'n_dims'¶: Settings field name for positions field spatial dimensions.

wepy.hdf5.MAIN_REP_IDXS = 'main_rep_idxs'¶: Settings field name for the indices of the full apparatus topology in the default positions trajectory field.

wepy.hdf5.ALT_REPS_IDXS = 'alt_reps_idxs'¶: Settings field name for the different ‘alt_reps’. The indices of the atoms from the full apparatus topology for each.

wepy.hdf5.FIELD_FEATURE_SHAPES_STR = 'field_feature_shapes'¶: Settings field name for the trajectory field shapes.

wepy.hdf5.FIELD_FEATURE_DTYPES_STR = 'field_feature_dtypes'¶: Settings field name for the trajectory field data types.

wepy.hdf5.UNITS = 'units'¶: Settings field name for the units of the trajectory fields.

wepy.hdf5.RECORD_FIELDS = 'record_fields'¶: Settings field name for the record fields that are to be included in the truncated listing of record group fields.

wepy.hdf5.CONTINUATIONS = 'continuations'¶: Settings field name for the continuations relationships between runs.

wepy.hdf5.TRAJECTORIES = 'trajectories'¶: Run field name for the trajectories group.

wepy.hdf5.INIT_WALKERS = 'init_walkers'¶: Run field name for the initial walkers group.

wepy.hdf5.DECISION = 'decision'¶: Run field name for the decision enumeration group.

wepy.hdf5.RESAMPLING = 'resampling'¶: Record group run field name for the resampling records

wepy.hdf5.RESAMPLER = 'resampler'¶: Record group run field name for the resampler records

wepy.hdf5.WARPING = 'warping'¶: Record group run field name for the warping records

wepy.hdf5.PROGRESS = 'progress'¶: Record group run field name for the progress records

wepy.hdf5.BC = 'boundary_conditions'¶: Record group run field name for the boundary conditions records

wepy.hdf5.NONE_STR = 'None'¶: String signifying a field of unspecified shape. Used for serializing the None python object.

wepy.hdf5.CYCLE_IDXS = '_cycle_idxs'¶: Group name for the cycle indices of sporadic records.

wepy.hdf5.SPORADIC_RECORDS = ('resampler', 'warping', 'resampling', 'boundary_conditions')¶: Enumeration of the record groups that are sporadic.

wepy.hdf5.N_DIMS = 3¶: Number of dimensions for the default positions.

wepy.hdf5.WEIGHTS = 'weights'¶: The field name for the frame weights.

wepy.hdf5.POSITIONS = 'positions'¶: The field name for the default positions.

wepy.hdf5.BOX_VECTORS = 'box_vectors'¶: The field name for the default box vectors.

wepy.hdf5.VELOCITIES = 'velocities'¶: The field name for the default velocities.

wepy.hdf5.FORCES = 'forces'¶: The field name for the default forces.

wepy.hdf5.TIME = 'time'¶: The field name for the default time.

wepy.hdf5.KINETIC_ENERGY = 'kinetic_energy'¶: The field name for the default kinetic energy.

wepy.hdf5.POTENTIAL_ENERGY = 'potential_energy'¶: The field name for the default potential energy.

wepy.hdf5.BOX_VOLUME = 'box_volume'¶: The field name for the default box volume.

wepy.hdf5.PARAMETERS = 'parameters'¶: The field name for the default parameters.

wepy.hdf5.PARAMETER_DERIVATIVES = 'parameter_derivatives'¶: The field name for the default parameter derivatives.

wepy.hdf5.ALT_REPS = 'alt_reps'¶: The field name for the default compound field observables.

wepy.hdf5.OBSERVABLES = 'observables'¶: The field name for the default compound field observables.

wepy.hdf5.WEIGHT_SHAPE = (1,)¶: Weights feature vector shape.

wepy.hdf5.WEIGHT_DTYPE¶: Weights feature vector data type.

wepy.hdf5.FIELD_FEATURE_SHAPES = (('time', (1,)), ('box_vectors', (3, 3)), ('box_volume', (1,)), ('kinetic_energy', (1,)), ('potential_energy', (1,)))¶: Default shapes for the default fields.

wepy.hdf5.FIELD_FEATURE_DTYPES = (('positions', <class 'float'>), ('velocities', <class 'float'>), ('forces', <class 'float'>), ('time', <class 'float'>), ('box_vectors', <class 'float'>), ('box_volume', <class 'float'>), ('kinetic_energy', <class 'float'>), ('potential_energy', <class 'float'>))¶: Default data types for the default fields.

wepy.hdf5.POSITIONS_LIKE_FIELDS = ('velocities', 'forces')¶: Default trajectory fields which are the same shape as the main positions field.

wepy.hdf5.DATA = 'data'¶: Name of the dataset in sparse trajectory fields.

wepy.hdf5.SPARSE_IDXS = '_sparse_idxs'¶: Name of the dataset that indexes sparse trajectory fields.

wepy.hdf5._iter_field_paths(grp)[source]¶

Return all subgroup field name paths from a group.

Useful for compound fields. For example if you have the group observables with multiple subfields:

observables - rmsd - sasa

Passing the h5py group ‘observables’ will return the full field names for each subfield:

‘observables/rmsd’
‘observables/sasa’

Parameters:: grp (h5py.Group) – The group to enumerate subfield names for.
Returns:: subfield_names – The full names for the subfields of the group.
Return type:: list of str

class wepy.hdf5.WepyHDF5(filename, mode='x', topology=None, units=None, sparse_fields=None, feature_shapes=None, feature_dtypes=None, n_dims=None, alt_reps=None, main_rep_idxs=None, swmr_mode=False, expert_mode=False)[source]¶

Bases: object

Wrapper for h5py interface to an HDF5 file object for creation and access of WepyHDF5 data.

This is the primary implementation of the API for creating, accessing, and modifying data in an HDF5 file that conforms to the WepyHDF5 specification.

Constructor for the WepyHDF5 class.

Initialize a new Wepy HDF5 file. This will create an h5py.File object.

The File will be closed after construction by default.

mode: r Readonly, file must exist r+ Read/write, file must exist w Create file, truncate if exists x or w- Create file, fail if exists a Read/write if exists, create otherwise

Parameters:

filename (str) – File path
mode (str) – Mode specification for opening the HDF5 file.
topology (str) – JSON string representing topology of system being simulated.
units (dict of str : str, optional) – Mapping of trajectory field names to string specs for units.
sparse_fields (list of str, optional) – List of trajectory fields that should be initialized as sparse.
feature_shapes (dict of str : shape_spec, optional) – Mapping of trajectory fields to their shape spec for initialization.
feature_dtypes (dict of str : dtype_spec, optional) – Mapping of trajectory fields to their shape spec for initialization.
n_dims (int, default: 3) – Set the number of spatial dimensions for the default positions trajectory field.
alt_reps (dict of str : list of int, optional) – Specifies that there will be ‘alt_reps’ of positions each named by the keys of this mapping and containing the indices in each value list.
main_rep_idxs (list of int, optional) – The indices of atom positions to save as the main ‘positions’ trajectory field. Defaults to all atoms.
expert_mode (bool) – If True no initialization is performed other than the setting of the filename. Useful mainly for debugging.

Raises:

AssertionError – If the mode is not one of the supported mode specs.
AssertionError – If a topology is not given for a creation mode.

Warns:

If initialization data was given but the file was opened in a read mode.

MODES = ('r', 'r+', 'w', 'w-', 'x', 'a')¶: The recognized modes for opening the WepyHDF5 file.

WRITE_MODES = ('r+', 'w', 'w-', 'x', 'a')¶

property swmr_mode¶

_create_init()[source]¶

Creation mode constructor.

Completely overwrite the data in the file. Reinitialize the values and set with the new ones if given.

_read_write_init()[source]¶: Read-write mode constructor.

_add_init()[source]¶

The addition mode constructor.

Create the dataset if it doesn’t exist and put it in r+ mode, otherwise, just open in r+ mode.

_read_init()[source]¶: Read mode constructor.

_set_default_init_field_attributes(n_dims=None)[source]¶

Sets the feature_shapes and feature_dtypes to be the default for this module. These will be used to initialize field datasets when no given during construction (i.e. for sparse values)

Parameters:: n_dims (int)

_get_field_path_grp(run_idx, traj_idx, field_path)[source]¶

Given a field path for the trajectory returns the group the field’s dataset goes in and the key for the field name in that group.

The field path for a simple field is just the name of the field and for a compound field it is the compound field group name with the subfield separated by a ‘/’ like ‘observables/observable1’ where ‘observables’ is the compound field group and ‘observable1’ is the subfield name.

Parameters:

run_idx (int)
traj_idx (int)
field_path (str)

Returns:

group (h5py.Group)
field_name (str)

_init_continuations()[source]¶

This will either create a dataset in the settings for the continuations or if continuations already exist it will reinitialize them and delete the data that exists there.

Returns:: continuation_dset
Return type:: h5py.Dataset

_add_run_init(run_idx, continue_run=None)[source]¶

Routines for creating a run includes updating and setting object global variables, increasing the counter for the number of runs.

Parameters:

run_idx (int)
continue_run (int) – Index of the run to continue.

_add_init_walkers(init_walkers_grp, init_walkers)[source]¶

Adds the run field group for the initial walkers.

Parameters:

init_walkers_grp (h5py.Group) – The group to add the walker data to.
init_walkers (list of objects implementing the Walker interface) – The walkers to save in the group

_init_run_sporadic_record_grp(run_idx, run_record_key, fields)[source]¶

Initialize a sporadic record group for a run.

Parameters:

run_idx (int)
run_record_key (str) – The record group name.
fields (list of field specs) – Each field spec is a 3-tuple of (field_name : str, field_shape : shape_spec, field_dtype : dtype_spec)

Returns:

record_group

Return type:

h5py.Group

_init_run_continual_record_grp(run_idx, run_record_key, fields)[source]¶

Initialize a continual record group for a run.

Parameters:

run_idx (int)
run_record_key (str) – The record group name.
fields (list of field specs) – Each field spec is a 3-tuple of (field_name : str, field_shape : shape_spec, field_dtype : dtype_spec)

Returns:

record_group

Return type:

h5py.Group

_init_run_records_field(run_idx, run_record_key, field_name, field_shape, field_dtype)[source]¶

Initialize a single field for a run record group.

Parameters:

run_idx (int)
run_record_key (str) – The name of the record group.
field_name (str) – The name of the field in the record group.
field_shape (tuple of int) – The shape of the dataset for the field.
field_dtype (dtype_spec) – An h5py recognized data type.

Returns:

dataset

Return type:

h5py.Dataset

_is_sporadic_records(run_record_key)[source]¶

Tests whether a record group is sporadic or not.

Parameters:: run_record_key (str) – Record group name.
Returns:: is_sporadic – True if the record group is sporadic False if not.
Return type:: bool

_init_traj_field(run_idx, traj_idx, field_path, feature_shape, dtype)[source]¶

Initialize a trajectory field.

Initialize a data field in the trajectory to be empty but resizeable.

Parameters:

run_idx (int)
traj_idx (int)
field_path (str) – Field name specification.
feature_shape (shape_spec) – Specification of shape of a feature vector of the field.
dtype (dtype_spec) – Specification of the feature vector datatype.

_init_contiguous_traj_field(run_idx, traj_idx, field_path, shape, dtype)[source]¶

Initialize a contiguous (non-sparse) trajectory field.

Parameters:

run_idx (int)
traj_idx (int)
field_path (str) – Field name specification.
feature_shape (tuple of int) – Shape of the feature vector of the field.
dtype (dtype_spec) – H5py recognized datatype

_init_sparse_traj_field(run_idx, traj_idx, field_path, shape, dtype)[source]¶

Parameters:

run_idx (int)
traj_idx (int)
field_path (str) – Field name specification.
feature_shape (shape_spec) – Specification for the shape of the feature.
dtype (dtype_spec) – Specification for the dtype of the feature.

_init_traj_fields(run_idx, traj_idx, field_paths, field_feature_shapes, field_feature_dtypes)[source]¶

Initialize a number of fields for a trajectory.

Parameters:

run_idx (int)
traj_idx (int)
field_paths (list of str) – List of field names.
field_feature_shapes (list of shape_specs)
field_feature_dtypes (list of dtype_specs)

_add_traj_field_data(run_idx, traj_idx, field_path, field_data, sparse_idxs=None)[source]¶

Add a trajectory field to a trajectory.

If the sparse indices are given the field will be created as a sparse field otherwise a normal one.

Parameters:

run_idx (int)
traj_idx (int)
field_path (str) – Field name.
field_data (numpy.array) – The data array to set for the field.
sparse_idxs (arraylike of int of shape (1,)) – List of cycle indices that the data corresponds to.

_extend_contiguous_traj_field(run_idx, traj_idx, field_path, field_data)[source]¶

Add multiple new frames worth of data to the end of an existing contiguous (non-sparse)trajectory field.

Parameters:

run_idx (int)
traj_idx (int)
field_path (str) – Field name
field_data (numpy.array) – The frames of data to add.

_extend_sparse_traj_field(run_idx, traj_idx, field_path, values, sparse_idxs)[source]¶

Add multiple new frames worth of data to the end of an existing contiguous (non-sparse)trajectory field.

Parameters:

run_idx (int)
traj_idx (int)
field_path (str) – Field name
values (numpy.array) – The frames of data to add.
sparse_idxs (list of int) – The cycle indices the values correspond to.

_add_sparse_field_flag(field_path)[source]¶

Register a trajectory field as sparse in the header settings.

Parameters:: field_path (str) – Name of the trajectory field you want to flag as sparse

_add_field_feature_shape(field_path, field_feature_shape)[source]¶

Add the shape to the header settings for a trajectory field.

Parameters:

field_path (str) – The name of the trajectory field you want to set for.
field_feature_shape (shape_spec) – The shape spec to serialize as a dataset.

_add_field_feature_dtype(field_path, field_feature_dtype)[source]¶

Add the data type to the header settings for a trajectory field.

Parameters:

field_path (str) – The name of the trajectory field you want to set for.
field_feature_dtype (dtype_spec) – The dtype spec to serialize as a dataset.

_set_field_feature_shape(field_path, field_feature_shape)[source]¶

Add the trajectory field shape to header settings or set the value.

Parameters:

field_path (str) – The name of the trajectory field you want to set for.
field_feature_shape (shape_spec) – The shape spec to serialize as a dataset.

_set_field_feature_dtype(field_path, field_feature_dtype)[source]¶

Add the trajectory field dtype to header settings or set the value.

Parameters:

field_path (str) – The name of the trajectory field you want to set for.
field_feature_dtype (dtype_spec) – The dtype spec to serialize as a dataset.

_extend_run_record_data_field(run_idx, run_record_key, field_name, field_data)[source]¶

Primitive record append method.

Adds data for a single field dataset in a run records group. This is done without paying attention to whether it is sporadic or continual and is supposed to be only the data write method.

Parameters:

run_idx (int)
run_record_key (str) – Name of the record group.
field_name (str) – Name of the field in the record group to add to.
field_data (arraylike) – The data to add to the field.

_run_record_namedtuple(run_record_key)[source]¶

Generate a namedtuple record type for a record group.

The class name will be formatted like ‘{}_Record’ where the {} will be replaced with the name of the record group.

Parameters:: run_record_key (str) – Name of the record group
Returns:: RecordType – The record type to generate records for this record group.
Return type:: namedtuple

_convert_record_field_to_table_column(run_idx, run_record_key, record_field)[source]¶

Converts a dataset of feature vectors to more palatable values for use in external datasets.

For single value feature vectors it unwraps them into single values.

For 1-D feature vectors it casts them as tuples.

Anything of higher rank will raise an error.

Parameters:

run_idx (int)
run_record_key (str) – Name of the record group
record_field (str) – Name of the field of the record group

Returns:

record_dset – Table-ified values

Return type:

list

Raises:

TypeError – If the field feature vector shape rank is greater than 1.

_convert_record_fields_to_table_columns(run_idx, run_record_key)[source]¶

Convert record group data to truncated namedtuple records.

This uses the specified record fields from the header settings to choose which record group fields to apply this to.

Does no checking to make sure the fields are “table-ifiable”. If a field is not it will raise a TypeError.

Parameters:

run_idx (int)
run_record_key (str) – The name of the record group

Returns:

table_fields – Mapping of the record group field to the table-ified values.

Return type:

dict of str : list

_make_records(run_record_key, cycle_idxs, fields)[source]¶

Generate a list of proper (nametuple) records for a record group.

Parameters:

run_record_key (str) – Name of the record group
cycle_idxs (list of int) – The cycle indices you want to get records for.
fields (list of str) – The fields to make record entries for.

Returns:

records

Return type:

list of namedtuple objects

_run_records_sporadic(run_idxs, run_record_key)[source]¶

Generate records for a sporadic record group for a multi-run contig.

If multiple run indices are given assumes that these are a contig (e.g. the second run index is a continuation of the first and so on). This method is considered low-level and does no checking to make sure this is true.

The cycle indices of records from “continuation” runs will be modified so as the records will be indexed as if they are a single run.

Uses the record fields settings to decide which fields to use.

Parameters:

run_idxs (list of int) – The indices of the runs in the order they are in the contig
run_record_key (str) – Name of the record group

Returns:

records

Return type:

list of namedtuple objects

_run_records_continual(run_idxs, run_record_key)[source]¶

Generate records for a continual record group for a multi-run contig.

If multiple run indices are given assumes that these are a contig (e.g. the second run index is a continuation of the first and so on). This method is considered low-level and does no checking to make sure this is true.

The cycle indices of records from “continuation” runs will be modified so as the records will be indexed as if they are a single run.

Uses the record fields settings to decide which fields to use.

Parameters:

run_idxs (list of int) – The indices of the runs in the order they are in the contig
run_record_key (str) – Name of the record group

Returns:

records

Return type:

list of namedtuple objects

_get_contiguous_traj_field(run_idx, traj_idx, field_path, frames=None)[source]¶

Access actual data for a trajectory field.

Parameters:

run_idx (int)
traj_idx (int)
field_path (str) – Trajectory field name to access
frames (list of int, optional) – The indices of the frames to return if you don’t want all of them.

Returns:

field_data – The data requested for the field.

Return type:

arraylike

_get_sparse_traj_field(run_idx, traj_idx, field_path, frames=None, masked=True)[source]¶

Access actual data for a trajectory field.

Parameters:

run_idx (int)
traj_idx (int)
field_path (str) – Trajectory field name to access
frames (list of int, optional) – The indices of the frames to return if you don’t want all of them.
masked (bool) – If True returns the array data as numpy masked array, and only the available values if False.

Returns:

field_data – The data requested for the field.

Return type:

arraylike

_add_run_field(run_idx, field_path, data, sparse_idxs=None, force=False)[source]¶

Add a trajectory field to all trajectories in a run.

By enforcing adding it to all trajectories at one time we promote in-run consistency.

Parameters:

run_idx (int)
field_path (str) – Name to set the trajectory field as. Can be compound.
data (arraylike of shape (n_trajectories, n_cycles, feature_vector_shape[0],...)) – The data for all trajectories to be added.
sparse_idxs (list of int) – If the data you are adding is sparse specify which cycles to apply them to.

If ‘force’ is turned on, no checking for constraints will be done.

_add_field(field_path, data, sparse_idxs=None, force=False)[source]¶

Add a trajectory field to all runs in a file.

Parameters:

field_path (str) – Name of trajectory field
data (list of arraylike) – Each element of this list corresponds to a single run. The elements of which are arraylikes of shape (n_trajectories, n_cycles, feature_vector_shape[0],…) for each run.
sparse_idxs (list of list of int) – The list of cycle indices to set for the sparse fields. If None, no trajectories are set as sparse.

property filename¶: The path to the underlying HDF5 file.

open(mode=None)[source]¶

Open the underlying HDF5 file for access.

Parameters:: mode (str) – Valid mode spec. Opens the HDF5 file in this mode if given otherwise uses the existing mode.

close()[source]¶: Close the underlying HDF5 file.

property mode¶: The WepyHDF5 mode this object was created with.

set_mode(mode)[source]¶: Set the mode for opening the file with.

property h5_mode¶: The h5py.File mode the HDF5 file currently has.

_set_h5_mode(h5_mode)[source]¶

Set the mode to open the HDF5 file with.

This really shouldn’t be set without using the main wepy mode as they need to be aligned.

property h5¶: The underlying h5py.File object.

run(run_idx)[source]¶

Get the h5py.Group for a run.

Parameters:: run_idx (int)
Returns:: run_group
Return type:: h5py.Group

traj(run_idx, traj_idx)[source]¶

Get an h5py.Group trajectory group.

Parameters:

run_idx (int)
traj_idx (int)

Returns:

traj_group

Return type:

h5py.Group

run_trajs(run_idx)[source]¶

Get the trajectories group for a run.

Parameters:: run_idx (int)
Returns:: trajectories_grp
Return type:: h5py.Group

property runs¶: The runs group.

run_grp(run_idx)[source]¶: A group for a single run.

run_start_snapshot_hash(run_idx)[source]¶: Hash identifier for the starting snapshot of a run from orchestration.

run_end_snapshot_hash(run_idx)[source]¶: Hash identifier for the ending snapshot of a run from orchestration.

set_run_start_snapshot_hash(run_idx, snaphash)[source]¶: Set the starting snapshot hash identifier for a run from orchestration.

set_run_end_snapshot_hash(run_idx, snaphash)[source]¶: Set the ending snapshot hash identifier for a run from orchestration.

property settings_grp¶: The header settings group.

decision_grp(run_idx)[source]¶

Get the decision enumeration group for a run.

Parameters:: run_idx (int)
Returns:: decision_grp
Return type:: h5py.Group

init_walkers_grp(run_idx)[source]¶

Get the group for the initial walkers for a run.

Parameters:: run_idx (int)
Returns:: init_walkers_grp
Return type:: h5py.Group

records_grp(run_idx, run_record_key)[source]¶

Get a record group h5py.Group for a run.

Parameters:

run_idx (int)
run_record_key (str) – Name of the record group

Returns:

run_record_group

Return type:

h5py.Group

resampling_grp(run_idx)[source]¶