ulamdyn package

Comprehensive API documentation with detailed information on how to use the functions and/or classes of ULaMDyn package. This is automatically generated from source code and comments.

Submodules

ulamdyn.data_loader module

class ulamdyn.data_loader.GetCoords

Bases: object

Class used to read the Cartesian coordinates from Newton-X MD trajectories.

Note

The Cartesian XYZ coordinates can be read either from dyn.out/dyn.xyz files generated by the classical NX or from the h5 file of new NX. Repeated geometries will be skipped.

It also provides a function to calculate the root-mean-squared deviation (RMSD) between each geometry read from the MD trajectories and a reference geometry. Before calculating the RMSD, the two geometries are aligned using the Kabsch algorithm (https://en.wikipedia.org/wiki/Kabsch_algorithm).

This class does not require arguments in its constructor. All the outputs generate by the class are given in angstroms.

trajectories (list): trajectories ID (TRAJXX) available in the working directory.

labels (numpy.ndarray): stores the sequence of atom labels.

eq_xyz (numpy.ndarray): stores the XYZ matrix of the reference geometry (geom.xyz).

xyz (numpy.ndarray): stores the XYZ matrices of all geometries read from the TRAJ directories.

rmsd (numpy.ndarray): vector of RMSD values between all geometries and the reference one.

dataset (pandas.dataframe): stores a dataframe of the flattened XYZ matrices.

property align_geoms: None

Calculate the RMSD between the current and reference geometries.

Note

Before calculating the RMSD, the method uses the Kabsch algorithm to find the optimal alignment between the loaded molecular geometry for each time step t and the reference geometry.

The calculated RMSD values will be stored in the class attribute rmsd_values, while the xyz attribute will be updated with the aligned geometries.

build_dataframe() None

Create a pandas DataFrame containing the XYZ coordinates from all trajectories.

After running this function, the class attribute dataset will be updated with the loaded DataFrame object.

dataset
eq_xyz
from_dyn(outfile) Tuple[numpy.ndarray, numpy.ndarray]

Get XYZ coordinates from the dyn.out file of classical Newton-X.

Parameters

outfile (str) – name of the NX output file, dyn.out

Returns

tuple containing an array of strings defining the atom labels, and a tensor of size (n_steps, n_atoms, 3) with all XYZ coordinates of a single MD trajectory.

Return type

tuple in the form (numpy.ndarray, numpy.ndarray)

static from_h5(outfile) Tuple[numpy.ndarray, numpy.ndarray]

Read the XYZ coordinates from the .h5 file generated by the new Newton-X.

Parameters

outfile (str) – name of the NX output file, usually dyn.h5

Returns

tuple containing an array of strings defining the atom labels, and a tensor of shape (n_steps, n_atoms, 3) with all XYZ coordinates of a single MD trajectory.

Return type

tuple in the form (numpy.ndarray, numpy.ndarray)

static from_xyz(xyzfile) Tuple[numpy.ndarray, numpy.ndarray]

Get XYZ coordinates from the dyn.xyz file of classical Newton-X.

This method has the same parameters and return type as in from_dyn().

labels
read_all_trajs() None

Concatenate the XYZ coordinates read from all available MD trajectories.

After running this method, the class attributes labels and xyz will be updated.

read_eq_geom() None

Read the XYZ coordinates of a reference geometry.

Note

This method should be executed before calculating the RMSD with the align_geoms() function.

A file with name geom.xyz must be provided in the working directory (TRAJECTORIES). After reading the coordinates, the method will update the class attributes labels and eq_xyz.

rmsd
property save_csv: None

Save all loaded geometries (raw format) into a csv file.

If the RMSD has been calculated, it will be included as an extra column in the XYZ coordinates data set.

The default name of the output file is all_coordinates.csv.

trajectories
xyz
class ulamdyn.data_loader.GetCouplings

Bases: object

Class used to read the Nonadiabatic Coupling Vectors (NAC) from the MD trajectories.

This class does not require arguments in its constructor. The outputs generate by the class is given in atomic units.

trajectories (list): trajectories ID (TRAJXX) available in the working directory.

all_nacs (dict): store the NAC matrices (per state) for all MD trajectories.

dataset (dict): dictionary of dataframes with the flattened gradient matrices.

all_nacs
build_dataframe(save_csv=False) None

Generate a dataset (pandas.DataFrame object) with all NACs.

The XYZ matrices with the nonadiabatic couplings obtained for each molecular geometry is flattened into a one dimensional vector. In this way, every row of the dataset corresponds to one step of a given MD trajectory.

Parameters

save_csv (bool, optional) – if True enable the output of all NAC dataframes in a csv format, defaults to False. The default name of the outputed dataset is all_nacs_s[n].csv, where n is the number of the state.

datasets
static from_h5(outfile) dict

Read NACs from one trajectory generated by the new Newton-X.

Parameters

outfile (str) – name of the NX output file in the .h5 format, usually dyn.h5.

Returns

stacked nonadiabatic coupling vectors (NAC) for all steps of a single MD trajectory, provided as a dictionary of tensors (numpy.ndarray, one for each state) with shape (n_steps, n_atoms, 3).

Return type

dict

static from_nxlog(outfile) dict

Read NACs from one trajectory generated by the classical Newton-X.

read_all_trajs() None

Read NACs from all MD trajectories and store into a dictionary.

trajectories
class ulamdyn.data_loader.GetGradients

Bases: object

Class used to read the QM gradients from Newton-X MD trajectories.

This class does not require arguments in its constructor. The outputs generate by the class is given in eV/angstrom.

trajectories (list): trajectories ID (TRAJXX) available in the working directory.

all_grads (dict): store the gradient matrices (per state) for all MD trajectories.

dataset (dict): dictionary of dataframes with the flattened gradient matrices.

all_grads
build_dataframe(save_csv=False) None

Generate a dataset (pandas.DataFrame object) with all gradients.

The XYZ matrices with the gradients of each molecular geometry is flattened into a one dimensional vector, such that every row of the dataset corresponds to one step of the MD trajectories.

Parameters

save_csv (bool, optional) – if True enable the output of all gradient dataframes in a csv format, defaults to False. The default name of the outputed dataset is all_gradients_s[n].csv, where n is the number of the state.

datasets
static from_h5(outfile) dict

Read XYZ gradients from a single trajectory generated by the new Newton-X.

Parameters

outfile (str) – name of the NX output file in the .h5 format, usually dyn.h5.

Returns

stacked XYZ gradients for all steps of a single MD trajectory, returned as as a dictionary of tensors (numpy.ndarray, one for each state) with shape (n_steps, n_atoms, 3).

Return type

dict

static from_nxlog(outfile) dict

Read XYZ gradients from one trajectory generated by the old Newton-X.

This method has the same parameters and return type as in from_h5().

Parameters

outfile (str) – Output file from Newton-X MD trajectory, namely nx.log.

Returns

dictionary of numpy.ndarray with shape (n_steps, n_atoms, 3), where the keys correspond to the state’s id.

Return type

dict

read_all_trajs() None

Read gradients from all MD trajectories and store into a dictionary.

trajectories
class ulamdyn.data_loader.GetProperties

Bases: object

Class used to read all properties available in the Newton-X MD trajectories.

Note

This class does not require arguments in its constructor. All the energy quantities processed by the class are transformed from Ha to eV. For the other properties, the original units used in Newton-X are kept.

trajectories (list): trajectories ID (TRAJXX) available in the working directory.

dataset (pd.dataframe): store a dataframe with all available properties.

num_states (int): keep track of the number of states considered in the MD simulation.

nx_version (str): identify the Newton-X version, can be either cs (classical series) or ns (new series).

dataset
energies()

Read / process the energy information from the en.dat (classical NX) or .h5 (new NX) file.

Returns

a processed dataset with the information of all trajectories stacked, and containing the following columns “TRAJ”, “time”, “State”, “Total_Energy” plus the energy gaps between the accessible states (e.g., DE12) and binary columns to identify the hopping points (e.g., Hops_S21).

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

mcscf_coefs()

Collect the squared coefficients of the MCSCF wavefunction.

Note

This method works only for MD trajectories generated with Columbus.

Returns

a dataset with the three highest MCSCF coefficients for each electronic state read from the NX output per state; if the class variable dataset has been already updated with some properties, the population data will be merged with the existing dataset.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

nac_norm()

Calculate the norm of the nonadiabatic coupling matrices for each pair of states.

After running this function, the class variable dataset will be updated with the loaded NACs norm data.

Returns

a dataset with the Frobenius norm of the nonadiabatic coupling matrices of all NX trajectories calculated for each possible transition between states; the number of columns in the NACs norm dataset is given by n_states * (n_states - 1)/2.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

num_states
nx_version
oscillator_strength()

Collect the oscillator strength from properties file.

Note

The oscillator strength (OSS) is not always available in the output. If needed, check the Newton-Xdocumentation to see what are the required keywords and methods. The OSS data can be read from either classical or new series NX trajectories.

Returns

a dataset with the oscillator strength information read from all available NX trajectories with one column for each transition between states; if the class variable dataset has been already updated with some properties, the oscillator strength data will be merged with the existing properties dataset.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

populations()

Read and calculate the population for each accessible state.

Note

If the MD simulations were performed with the classical Newton-X, the populations will be calculated using the wavefunction coefficients. In the new Newton-X, the populations are already calculated and available in the .h5 file.

Returns

a dataset with the populations obtained from all available NX trajectories with one column per state; if the class variable dataset has been already updated with some properties, the population data will be merged with the existing dataset.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

property save_csv: None

Save the dataset with all QM properties read from the Newton-X trajectories.

The following properties are included in the dataset:

  • trajectory index;

  • simulation time;

  • total energy;

  • energy gaps between states (eV);

  • oscillator strength (if available);

  • states population.

  • norm of the nonadiabatic coupling matrices (if available);

  • three highest MCSSCF coefficients per state (only for NX/Columbus);

trajectories

ulamdyn.data_writer module

class ulamdyn.data_writer.Geometries(atom_labels, add_properties=[])

Bases: object

Handle and save XYZ coordinates for selected frames of MD trajectories.

save_xyz(geoms_array, properties_data=None, out_name='selected_geoms.xyz')

Save an XYZ file for a set of selected molecular geometries.

Note

The comment line of the XYZ file will contain a list of property values for each molecule.

Parameters
  • geoms_array (numpy.ndarray) – a 3D array containing the list of XYZ matrices.

  • properties_data (pandas.DataFrame) – dataframe containing the property values of the selected geometries, defaults to None.

  • out_name (str) – name of the XYZ file containing all the selected geometries, defaults to selected_geoms.xyz.

ulamdyn.descriptors module

Classes and methods used to ML descriptors from molecular geometries.

class ulamdyn.descriptors.R2

Bases: ulamdyn.data_loader.GetCoords

Class used to convert the XYZ coordinates of molecular geometries into R2-type of descriptors.

The R2 descriptor is defined as the (flattened) matrix of all pairwise Euclidean distances between all atoms in the molecule. Since the matrix is symmetric with respect to the interchange of atom indices (i.e., Dij = Dji) only the lower triangular portion of the R2 matrix will be outputed in the final data set.

This class also provides a method to compute other molecular descriptors derived from the R2 distance matrix. They are:

  • inverse R2 -> defined as \(1/R_{ij}\), similarly to the Coulomb matrix descriptor.

  • delta R2 -> difference between the R2 vector of the current geometry in time t and the equivalent R2 vector of a reference geometry (typically the ground-state geometry), \(R_{ij}(t) - R_{ij}(ref)\).

  • RE -> R2 vector normalized relative to equilibrium geometry, \(R_{ij}(eq)/R_{ij}(t)\). For more details, check the reference J. Chem. Phys. 146, 244108 (2017).

build_descriptor(all_geoms: numpy.ndarray, variant=None, save_csv=False)

Generate a dataframe with R2-based descriptors for all molecular geometries.

Parameters
  • all_geoms (numpy.ndarray) – tensor of shape (nsamples, natoms, 3) containing the stacked XYZ coordinates read from all available MD trajectories.

  • variant (str, optional) – type of R2 descriptor, defaults to None.

  • save_csv (bool, optional) – if True export the data set with all calculated descriptors in a csv format with name pairwise_distances.csv, defaults to False.

Returns

a dataframe object of shape (nsamples, natoms * (natoms - 1)/2), where each row is a vector with the R2-based descriptor computed for a given molecular geometry.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

r2_descriptor
r2_ref_geom
static xyz_to_distances(xyz_matrix: numpy.ndarray) numpy.ndarray

Calculate the pairwise distance matrix for a given XYZ geometry.

Parameters

xyz_matrix (numpy.ndarray) – Cartesian coordinates of a molecular structure given as a matrix of shape (n_atoms, 3).

Returns

vector of size \(n_{atoms} (n_{atoms} - 1)/2\) containing the lower triangular portion of the R2 matrix.

Return type

numpy.ndarray

class ulamdyn.descriptors.ZMatrix

Bases: ulamdyn.data_loader.GetCoords

Class used to generate molecular descriptors using internal coordinates (Z-Matrix).

This class does not require arguments in its constructor. All quantities related to distances are given in angstrom, while the features derived from angles are provided in degrees.

In addition to the standard Z-Matrix, the class also provides a method to compute other variants of the Z-Matrix molecular descriptors:

  • delta Z-Matrix -> difference between the Z-Matrix representation of the current geometry in time t and the Z-Matrix of a reference geometry.

  • tanh Z-Matrix -> hyperbolic tangent transformation on all features of delta Z-Matrix.

  • sig Z-Matrix -> sigmoid transformation on all features of delta Z-Matrix.

distancematrix (numpy.ndarray): stores the full matrix of bond distances for all geometries.

connectivity (list): indices of connected atoms based on a distance criterion of proximity.

angleconnectivity (list): indices of three neighboring atoms to compute angles.

dihedralconnectivity (list): four indices of neighboring atoms to calculate dihedrals.

zmat_ref_geom (numpy.ndarray): stores the Z-matrix calculated for the reference geometry.

angleconnectivity
build_descriptor(all_geoms: numpy.ndarray, delta=False, apply_to_delta=None, save_csv=False)

Construct the standard Z-Matrix descriptor and other variants.

Note

By default, the algorithm will calculate the three main components of the Z-Matrix: bond distances, angles and dihedrals. An augmented version of the Z-Matrix descriptor can be also obtained by calculating additional distances, angles, dihedrals and/or bending angles using the methods provided in the class.

Parameters
  • all_geoms (numpy.ndarray) – tensor of shape (nsamples, natoms, 3) containing the stacked XYZ coordinates read from all available MD trajectories.

  • delta (bool, optional) – if True, the Z_matrix feature vector of each geometry will be subtracted from the Z_Matrix of the reference geometry, defaults to False.

  • apply_to_delta (str, optional) – select a nonlinear function to apply as a transformation (sigmooid or hyperbolic tangent) on the delta Z-matrix, defaults to None.

  • save_csv (bool, optional) – if true save a single csv file named all_geoms_zmatrix.csv containing the Z-Matrix descriptors computed for all geometries available the MD trajectories., defaults to False.

Returns

a dataframe object with the (flattened) Z-Matrix descriptors stacked for all MD geometries.

Return type

pandas.DataFrame

connectivity
dihedralconnectivity
distancematrix
static get_angle(geom: numpy.ndarray, idx_atoms: list) float

Calculate the angle formed by three selected atoms.

Parameters
  • geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.

  • idx_atoms (list) – a list of three indices corresponding to the atoms for which the angle will be calculated.

Returns

angle (in degrees) between three selected atoms.

Return type

numpy.float

static get_bending(geom: numpy.ndarray, idx_atoms: list) float

Calculate the bending angle between two planes of the molecule.

This method is particularly useful to describe large out-of-plane distortions in the molecular structure that involves more than four atoms. The bending angle is calculated by first defining two vectors each one perpendicular to different molecular planes formed by two sets of three atoms. Then, the angle between the two vectors is obtained by calculating the inverse cosine of the scalar product between these vectors.

Note

By default, the bending angle is not used to construct the Z-Matrix descriptor. It can be used to construct an augmented version of the Z-Matrix that better captures changes in the molecular structure during the dynamics.

Parameters
  • geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.

  • idx_atoms (list) – a list of lists with three atom indices in each, used to define two molecular planes for which the angle will be calculated.

Returns

bending angle (in degrees) defined by six specified atoms.

Return type

np.float

static get_dihedral(geom: numpy.ndarray, idx_atoms: list) float

Calculate the dihedral angle formed by four selected atoms.

Parameters
  • geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.

  • idx_atoms (list) – a list of four indices to select the atoms for which the dihedral angle will be calculated.

Returns

dihedral angle (in degrees) formed by four specified atoms.

Return type

numpy.float

static get_distance(geom: numpy.ndarray, idx_atoms: list) float

Calculate the Euclidean distance between a pair of atoms.

Parameters
  • geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.

  • idx_atoms (list) – a pair of indices corresponding to the atoms for which the distance will be calculated.

Returns

Euclidean distance (in Angstrom) between two selected atoms.

Return type

numpy.float

zmat_ref_geom

ulamdyn.interface module

ulamdyn.interface.build_descriptor(args, getcoords_obj)
ulamdyn.interface.get_kinetic_energies(n_atoms=None)
ulamdyn.interface.get_properties_data(rmsd_vec=None)
ulamdyn.interface.run_bootstrap(args)
ulamdyn.interface.run_clustering(args)
ulamdyn.interface.run_dim_reduction(args)
ulamdyn.interface.save_data(data_to_save)
ulamdyn.interface.save_xyz_hoppings(states_pair)

ulamdyn.kinetics module

class ulamdyn.kinetics.GetVelocities(n_atoms=None)

Bases: object

Class used to collect velocities from all MD trajectories of Newton-X.

classmethod from_all_trajs(n_atoms=None)

Wrapper function to collect the MD velocities without instantiating the class.

Parameters

n_atoms (int, optional) – define the total number of atoms to consider in the velocities dataset, defaults to None. If None, the value will be taken from the geom.xyz file.

Returns

the full velocities data set given as a tensor of shape (n_steps, n_atoms, 3).

Return type

numpy.ndarray

from_dyn(outfile: str) numpy.ndarray

Read velocities from a single trajectory of Newton-X.

Note

The final velocities dataset will contains information only for the first set of n_atoms, as defined by the class attribute n_atoms.

Parameters

outfile (str) – name of the NX output file containing the velocities, dyn.out.

Returns

tensor of shape (n_steps, n_atoms, 3) with all velocities data.

Return type

numpy.ndarray

from_h5(outfile: str) numpy.ndarray

Read velocities from the .h5 file generated by the new Newton-X.

Note

The full velocities dataset will be sliced to select only the first set of n_atoms, as defined by the class attribute n_atoms.

Parameters

outfile (str) – name of the NX output file, usually dyn.h5.

Returns

tensor of shape (n_steps, n_atoms, 3) with all velocities data.

Return type

numpy.ndarray

n_atoms
read_all_trajs()

Concatenate the XYZ velocities read from all available MD trajectories.

After running this method, the class attribute veloc will be updated with the full dataset of velocities read from the Newton-X output files.

trajectories
veloc
class ulamdyn.kinetics.KineticEnergy(n_atoms=None)

Bases: object

Class used to calculate the atomic or molecular components of the MD kinetic energy.

atom_labels
atom_mass
build_dataframe(discretization_level='atom', n_mols_per_type=None, n_atoms_per_mol=None, save_csv=False)

Create a dataset containing the kinetic energies splitted by atoms or molecules.

Parameters
  • discretization_level (str, optional) – how to split the kinetic energy, defaults to “atom”

  • n_mols_per_type (int, optional) – number of molecules per subsystem, typically one solute and several solvent molecules as in QM/MM, defaults to None.

  • n_atoms_per_mol (int, optional) – number of atoms in each molecular subsystem, defaults to None.

  • save_csv (bool, optional) – export the dataset with the computed kinetic energies as a csv file, defaults to False

Returns

a structured dataset with the values of kinetic energies computed for all trajectories, where each row corresponds to one time step of a given trajectory and the columns refer to the discretization level (atoms or molecules).

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

calc_per_atom() numpy.ndarray

Calculate the kinetic energy of each atom for all MD trajectories available.

Returns

matrix with the kinetic energy of the selected atoms arranged in columns.

Return type

numpy.ndarray

calc_per_molecule(n_mols_per_type: list, n_atoms_per_mol: list) numpy.ndarray

Sum the atomic contributions of kinetic energy for each molecule.

Note

In the current version, this function only supports two different types of molecules in the geom file, where the first set of atoms in the geom file should always correspond to the solute. The function will be generalized in future versions to support any number of different molecules.

Parameters
  • n_mols_per_type (list) – give the number of molecules in each one of the two subsystems, typically one solute and several solvent molecules as in QM/MM.

  • n_atoms_per_mol (list) – how many atoms exist in each molecular subsystem.

Returns

a matrix of shape (n_samples, n_mols_1 + n_mols_2) with the kinetic energy of each molecule in the system arranged in columns.

Return type

numpy.ndarray

energies
n_atoms
trajectories
class ulamdyn.kinetics.VibrationalSpectra(n_atoms=None)

Bases: ulamdyn.kinetics.GetVelocities

Calculate the vibrational density of states for each MD trajectory.

build_dataframe(save_csv=False)

Create a dataset with information of vibrational spectra per MD trajectory.

Parameters

save_csv (bool, optional) – if True, export the dataset with the computed spectra to a csv file, defaults to False.

Returns

two-columns dataframe storing the vibrational frequencies (cm^-1) in the first columns and the density of states in the second one (a.u.)

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

calc_all_trajs(mass_weighted=True)

Calculate the vibrational spectra for each MD trajectory.

static calc_pdos(Vel, dt, mass=None)

Calculate the power spectrum of the velocity auto-correlation function.

Parameters
  • Vel (numpy.ndarray) – tensor of shape (n_steps, n_atoms, 3) with atomic velocities collected from a single MD trajectory.

  • dt (numpy.ndarray) – time step used to integrate the classical equations.

  • mass – if provided, rescale the velocities by the atomic mass of each specie; the default is None.

Returns

two-columns array containing the calculated frequencies (cm^-1) in the first columns and the density of states in the second one (a.u.)

Return type

numpy.ndarray

n_atoms
trajectories
veloc

ulamdyn.run_analysis module

ulamdyn.run_analysis.main()

ulamdyn.statistics module

Module to perform statistical analysis of the MD datasets.

ulamdyn.statistics.aggregate_data(data, vars_to_group=['time'])

Calculate the basic statistical descriptors for a given dataset.

The statistical quantities calculated by the function are mean and median to describe the central tendency, and standard deviation to measure the variability or dispersion of the data. So each feature of the original input dataset will be unfolded into three new columns identified with the suffixes ‘_median’, ‘_mean’ and ‘_std’.

Parameters
  • data (pandas.DataFrame) – input dataset containing the information extracted from all MD trajectories.

  • vars_to_group (list, optional) – set of variables used by the function to group the data, defaults to [“time”]

Returns

[description]

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

ulamdyn.statistics.bootstrap(dataframe, n_samples=None, n_repeats=1000, save_csv=False)

Estimate the mean by resampling the data with replacement.

Parameters
  • dataframe (pandas.DataFrame | modin.pandas.dataframe.DataFrame) – input dataset having the quantities extrated from all available MD trajectories; the dataset must contain the ‘TRAJ’ and ‘time’ columns

  • n_samples (int, optional) – number of trajectories considered in each round of the resampling, defaults to None

  • n_repeats (int, optional) – number of resampling to be performed, defaults to 1000

  • save_csv (bool, optional) – if true outputs the final bootstrapped data in a csv file, defaults to False

Returns

dataset containing all the bootstrapped data with estimated mean

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

ulamdyn.statistics.calc_avg_occupations(df)

Calculate the fraction of trajectories in each state as a function of time.

Note

The fraction of trajectories (occupation) is an important quantity to assess the quality of the surface hopping (SH) simulations. It should be compared to the population of each state averaged over all trajectories, which is provided by the method aggregate_data(). If the ensemble of SH trajectories is statistically converged, the occupation and the average population should match.

Parameters

df (pandas.DataFrame | modin.pandas.dataframe.DataFrame) – properties dataset with information collected from the available MD trajectories

Returns

a new dataset with the occupations calculated for each state

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

ulamdyn.statistics.create_bootstrap_stats(boot_data, ci_level=95)

Generate the descriptive statistics for the bootstrapped data within a given confidence interval.

Parameters
  • boot_data (pandas.DataFrame | modin.pandas.dataframe.DataFrame) – bootstrapped dataframe generate by the bootstrap() function

  • ci_level (int, optional) – Size of the confidence interval to draw when aggregating the data (by time) to estimate the statistical descriptors (median, mean, std), defaults to 95

Returns

dataframe grouped by the simulation time with each feature of the original dataset unfolded into three columns representing the statistical descriptors and other two columns storing the values for the lowest and highest variability as defined by the confidence interval

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

ulamdyn.statistics.create_stats(selected_data, save_csv=False)

Create datasets with the statistical summary of the requested quantities.

Parameters
  • selected_data (str) – type of data used as input to compute the descriptive statistical properties using the aggregate_data() function

  • save_csv (bool, optional) – if true exports the calculated statistics for each dataframe in a csv format, defaults to false

Returns

one or several datasets containing the median, mean and standard deviation calculated for each feature of the input as a function of time; if the attribute is equal to ‘all’, the statistics will be calculated for the properties and descriptors (R2 and Z-Matrix) datasets.

Return type

dict(pandas.DataFrame) | dict(modin.pandas.dataframe.DataFrame)

ulamdyn.statistics.stats_hopping(dataframe)

Generate a statistical summary for the hopping points.

Parameters

dataframe (pandas.DataFrame | modin.pandas.dataframe.DataFrame) – properties dataset with information collected from the available MD trajectories

Returns

a new dataset containing the total number of hops per trajectory and the minimum and maximum time in which the hops between each pair of electronic states occur.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

ulamdyn.unsup_models module

Base classes and methods used to perform unsupervised learning analysis.

class ulamdyn.unsup_models.Clustering(data, n_samples=None, scaler=None, random_state=42, n_cpus=- 1, verbosity=0)

Bases: ulamdyn.unsup_models.Utils

Class used to find groups of similar geometries in the MD trajectories data.

hierarchical(n_clusters=5, affinity='cosine', connectivity=None, linkage='single', distance_threshold=None, save_model=True)

Perform a hierarchical cluster analysis based on the agglomerative.

Parameters
  • n_clusters (int, optional) – The number of clusters to find. It must be None if distance_threshold is not None, defaults to 5.

  • affinity (str, optional) – Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix (instead of a similarity matrix) is needed as input for the fit method, defaults to “cosine”.

  • connectivity (array-like or callable, optional) – Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix, such as derived from kneighbors_graph. Default is None, i.e, the hierarchical clustering algorithm is unstructured.

  • linkage (str, optional) – Define the linkage criterion to build the tree. It determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. The options are : + ‘ward’ -> minimizes the variance of the clusters being merged. + ‘average’ -> uses the average of the distances of each observation of the two sets. + ‘complete’ or ‘maximum’ -> uses the maximum distances between all observations of the two sets. + ‘single’ -> uses the minimum of the distances between all observations of the two sets. The default is “single”.

  • distance_threshold (float, optional) – The linkage distance threshold above which, clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True, defaults to None.

  • save_model (bool, optional) – Store the trained parameters of the model in a binary file, defaults to True.

Returns

Dataframe of shape (n_samples,) with cluster labels for each data point.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

kmeans(n_clusters=5, init='k-means++', n_init=500, max_iter=1000, convergence=1e-06, save_model=True)

Perform K-Means clustering.

Parameters
  • n_clusters (int, list or str optional) – The number of clusters to form that corresponds also to the number of cluster centroids to generate, defaults to 5. If a list is passed, the k-means algorithm will be run for all n_clusters in the list, whereas if the argument is equal to ‘best’, consecutive runs will be performed with n_clusters varying in the range of [2, 15]. In both cases, the final results will be the best output labels with respect to the clustering performance on the silhouette and Calinski-Harabasz scores.

  • init (str or array, optional) – Method for initialization : + ‘k-means++’ -> selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. + ‘random’ -> choose n_clusters observations (rows) at random from data for the initial centroids. + If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. Defaults to “k-means++”.

  • n_init (int, optional) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of loss function, defaults to 500.

  • max_iter (int, optional) – Maximum number of iterations of the k-means algorithm for a single run, defaults to 1000

  • convergence (float, optional) – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence, defaults to 1e-06.

  • save_model (bool, optional) – Store the trained parameters of the model in a binary file, defaults to True.

Returns

Dataframe of shape (n_samples,) with cluster labels for each data point.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

spectral(n_clusters=5, n_components=10, n_init=100, affinity='rbf', gamma=0.01, n_neighbors=20, degree=3, coef0=1, kernel_params=None, save_model=True)

Apply clustering to a projection of the normalized Laplacian.

Note that Spectral Clustering is a highly expensive method due to the computation of the affinity matrix. Hence, this method is recommended only for small to medium size datasets (n_samples < 10000).

Note

This method is equivalent to kernel k-means (https://dl.acm.org/doi/10.1145/1014052.1014118). Spectral clustering is recommended for non-linearly separable dataset, where the individual clusters have a highly non-convex shape.

Parameters
  • n_clusters (int, optional) –

    The number of clusters to form which in this case corresponds to the dimension of the projection subspace. The default is 5.

    If a list is passed, the k-means algorithm will be run for all n_clusters in the list, whereas if the argument is equal to ‘best’, consecutive runs will be performed with n_clusters varying in the range of [2, 15]. In both cases, the final results will be the best output labels with respect to the clustering performance on the silhouette and Calinski-Harabasz scores.

  • n_components (int, optional) – Number of eigenvectors to use for the spectral embedding, defaults to 10

  • n_init (int, optional) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Only used if assign_labels=’kmeans’. The default is 100.

  • affinity (str or callable, optional) – Method used to construct the affinity matrix. The available options are : + ‘nearest_neighbors’: construct the affinity matrix by computing a graph of nearest neighbors. + ‘rbf’: construct the affinity matrix using a radial basis function (RBF) kernel. + ‘precomputed_nearest_neighbors’: interpret X as a sparse graph of precomputed distances, and construct a binary affinity matrix from the n_neighbors nearest neighbors of each instance. + one of the kernels supported by pairwise_kernels. The default method is “rbf”.

  • gamma (float, optional) – Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for affinity=’nearest_neighbors’. Defaults to 0.01.

  • n_neighbors (int, optional) – Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for affinity=’rbf’, defaults to 20.

  • degree (int, optional) – Degree of the polynomial kernel. Ignored by other kernels. Defaults to 3.

  • coef0 (int, optional) – Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels. Defaults to 1.

  • kernel_params (dict or str, optional) – Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels. Defaults to None.

  • save_model (bool, optional) – Store the trained parameters of the model in a binary file, defaults to True.

Returns

Dataframe of shape (n_samples,) with cluster labels for each data point.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

class ulamdyn.unsup_models.DimensionalityReduction(data, n_samples=None, scaler=None, random_state=42, n_cpus=- 1)

Bases: ulamdyn.unsup_models.Utils

Class used to find a low dimensional representation of MD trajectories data.

isomap(n_components=2, n_neighbors=10, neighbors_algorithm='auto', metric='cosine', p=2, metric_params=None, calc_error=False)

Perform a nonlinear dimensionality reduction through Isometric Mapping.

Parameters
  • n_components (int, optional) – Number of coordinates (features) for the low-dimensional manifold, defaults to 2.

  • n_neighbors (int, optional) – Number of neighbors to consider around each point, defaults to 10.

  • neighbors_algorithm (str, optional) – Method used for nearest neighbors search, defaults to “auto”

  • metric (str or callable, optional) – The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by sklearn.metrics.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. Defaults to “cosine”.

  • p (int, optional) – Parameter for the Minkowski metric from sklearn.metrics.pairwise pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. Defaults to 2.

  • metric_params (dict, optional) – Additional keyword arguments for the metric function. Defaults to None.

  • calc_error (bool, optional) – If True, the reconstruction error between the original and the projected data will be calculated, defaults to False.

Returns

a new dataset with the transformed values where the coordinates of the low-dimensional manifold are stored in columns.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

kpca(n_components=2, kernel='rbf', gamma=None, degree=4, coef0=1, kernel_params=None, alpha=1.0, fit_inverse_transform=False)

Perform a nonlinear dimensionality reduction using kernel PCA.

Parameters
  • n_components (int, optional) – Number of components (features) to keep after KPCA transformation, defaults to 2.

  • kernel (str, optional) – Kernel function used in the transformation. The possible values are ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘cosine’ or precomputed’, defaults to “rbf”.

  • gamma (float, optional) – Kernel coefficient for rbf, poly and sigmoid kernels. Ignored by other kernels. If gamma is None, then it is set to 1/n_features. Defaults to None.

  • degree (int, optional) – Degree of polynomial kernel. Ignored by other kernels. Defaults to 4.

  • coef0 (int, optional) – Independent term in poly and sigmoid kernels. Ignored by other kernels. Defaults to 1.

  • kernel_params (dict, optional) – Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels. Defaults to None.

  • alpha (float, optional) – Hyperparameter of the ridge regression that learns the inverse transform (when fit_inverse_transform=True), defaults to 1.0.

  • fit_inverse_transform (bool, optional) – Hyperparameter of the ridge regression that learns the inverse transform (when fit_inverse_transform=True), defaults to False.

Returns

a new dataset with the transformed values where the selected components are stored in columns.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

pca(n_components=2, calc_error=False, save_errors=False)

Perform a linear dimensionality reduction using principal component analysis.

Note

By default the percentage of variance explained by each of the selected components will be printed after the PCA analysis.

Parameters
  • n_components (int, optional.) – Number of principal components to keep, defaults to 2.

  • calc_error (bool, optional) – If True, the reconstruction error between the original and the projected data will be calculated, defaults to False.

  • save_errors (bool, optional) – If True, save to a csv file the reconstruction error calculated for each sample, defaults to False.

Returns

a new dataset with the transformed values where the selected components are stored in columns.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

tsne(n_components=2, perplexity=40.0, learning_rate=200.0, n_iter=2000, n_iter_without_progress=400, metric='euclidean', init='pca', verbose=1, method='barnes_hut')

Perform the t-distributed Stochastic Neighbor Embedding analysis.

Parameters
  • n_components (int, optional) – Number of coordinates (features) for the low-dimensional embbeding, defaults to 2.

  • perplexity (float, optional) – This hyperparameter is used to control the attention between local and global aspects of the data, in a certain sense, by guessing the number of close neighbors each point has. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significantly different results. Defaults to 40.0.

  • learning_rate (float, optional) – The learning rate for t-SNE is usually in the range [10.0, 1000.0]. If the learning rate is too high, the data may look like a ‘ball’ with any point approximately equidistant from its nearest neighbours. If the learning rate is too low, most points may look compressed in a dense cloud with few outliers. If the cost function gets stuck in a bad local minimum increasing the learning rate may help. Defaults to 200.0

  • n_iter (int, optional) – Maximum number of iterations for the optimization. Should be at least 250. Defaults to 2000.

  • n_iter_without_progress (int, optional) – Maximum number of iterations without progress before we abort the optimization, used after 250 initial iterations with early exaggeration. Note that progress is only checked every 50 iterations so this value is rounded to the next multiple of 50. Defaults to 400.

  • metric (str or callable, optional) – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy. spatial.distance.pdist for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them. The default is “euclidean” which is interpreted as squared euclidean distance. Defaults to “euclidean”.

  • init (str, optional) – Initialization of embedding. Possible options are ‘random’, ‘pca’, and a numpy array of shape (n_samples, n_components). PCA initialization cannot be used with precomputed distances and is usually more globally stable than random initialization. Defaults to “pca”.

  • verbose (int, optional) – Verbosity level. Defaults to 1

  • method (str, optional) – By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples. Defaults to “barnes_hut”.

Returns

a new dataset with the transformed values where the coordinates of the low-dimensional manifold are stored in columns.

Return type

pandas.DataFrame | modin.pandas.dataframe.DataFrame

ulamdyn.utilities module

Auxiliary functions and constants used by the main modules.

ulamdyn.utilities.get_labels_masses(traj_dir, n_atoms=None)
ulamdyn.utilities.get_num_atoms()
ulamdyn.utilities.get_nx_version(traj_dir)
ulamdyn.utilities.get_traj_dirs()
ulamdyn.utilities.read_h5_nx(traj)
ulamdyn.utilities.read_nx_control(traj_dir)

Module contents


ULaMDyn is a python package built on top of sklearn designed to perform data preprocessing, statistical and unsupervised learning analysis of (non-adiabatic) molecular dynamics simulations.

ULaMDyn consists of five general modules:

  • Data_Loader

  • Data_Writer

  • Statistics

  • Kinetics

  • Descriptors

  • Unsup_Models

ulamdyn.export(func)