ulamdyn package

Comprehensive API documentation with detailed information on how to use the functions and/or classes of ULaMDyn package. This is automatically generated from source code and comments.

Subpackages

Submodules

ulamdyn.data_loader module

The ulamdyn.data_loader module is used to build datasets from NAMD output files.

class ulamdyn.data_loader.GetCoords

Bases: BaseClass

Read the Cartesian coordinates from Newton-X MD trajectories.

Note

In the case of NX classical series, the Cartesian XYZ coordinates can be read either from dyn.out or dyn.xyz. For NX new series, the data will be read from the h5 file available in each TRAJ directory. Repeated geometries are skipped.

It also provides a function to calculate the root-mean-squared deviation (RMSD) between each geometry read from the MD trajectories and a reference geometry. Before calculating the RMSD, the two geometries are aligned using the Kabsch algorithm (https://en.wikipedia.org/wiki/Kabsch_algorithm).

This class does not require arguments in its constructor. All the outputs generate by the class are given in angstroms.

Data attributes:

trajectories (dict): contain indices of the trajectories available
with the respective maximum time to read.

labels (numpy.ndarray): stores the sequence of atom labels.

eq_xyz (numpy.ndarray): stores the XYZ matrix of the reference
geometry (geom.xyz).

xyz (numpy.ndarray): stores the XYZ matrices of all geometries read
from the TRAJ directories.

rmsd (numpy.ndarray): vector of RMSD values between all geometries
and the reference one.

dataset (pandas.dataframe): stores a dataframe of the flattened
XYZ matrices.

align_geoms(xyz_data=None) → None

Calculate the RMSD between the current and reference geometries.

Note

Before calculating the RMSD, the method uses the Kabsch algorithm to find the optimal alignment between the loaded molecular geometry for each time step t and the reference geometry.

The calculated RMSD values will be stored in the class attribute rmsd_values, while the xyz attribute will be updated with the aligned geometries.

build_dataframe() → None

Create a DataFrame containing flattened XYZ coordinates from all MD trajectories.

After running this function, the class attribute dataset will be updated with the loaded DataFrame object.

dataset

eq_xyz

from_dyn(outfile: str, tmax=100000) → Tuple[ndarray, ndarray, ndarray]

Get XYZ coordinates from the dyn.out file of classical Newton-X.

Parameters:: outfile (str) – name of the NX output file, dyn.out
Returns:: tuple containing an array of strings defining the atom labels, a tensor of size (n_steps, n_atoms, 3) with all XYZ coordinates of a single MD trajectory, and an array with the time steps.
Return type:: tuple in the form (numpy.ndarray, numpy.ndarray, numpy.ndarray)

static from_h5(outfile: str) → Tuple[ndarray, ndarray, ndarray]

Read the XYZ coordinates of the MD frames generated by Newton-X NS.

Parameters:: outfile (str) – name of the NX output file, usually dyn.h5
Returns:: tuple containing an array of strings defining the atom labels, a tensor of size (n_steps, n_atoms, 3) with all XYZ coordinates of a single MD trajectory, and an array with the time steps.
Return type:: tuple in the form (numpy.ndarray, numpy.ndarray, numpy.ndarray)

static from_xyz(xyzfile) → Tuple[ndarray, ndarray]

Get XYZ coordinates from the dyn.xyz file of classical Newton-X.

This method has the same parameters and return type as in from_dyn().

labels

read_all_trajs(calc_rmsd: bool = True) → None

Concatenate the XYZ coordinates read from all MD trajectories.

After running this method, the class attributes labels and xyz will be updated.

read_eq_geom() → None

Read the XYZ coordinates of a reference geometry.

Note

This method should be executed before calculating the RMSD with the align_geoms() function.

A file with name geom.xyz must be provided in the working directory (TRAJECTORIES). After reading the coordinates, the method will update the class attributes labels and eq_xyz.

rmsd

save_csv() → None

Save all loaded geometries (raw format) into a csv file.

If the RMSD has been calculated, it will be included as an extra column in the XYZ coordinates data set.

The default name of the output file is all_coordinates.csv.

traj_time

trajectories

xyz

class ulamdyn.data_loader.GetCouplings

Bases: BaseClass

Class used to read the Nonadiabatic Coupling Vectors (NAC) from the MD trajectories.

This class does not require arguments in its constructor. The outputs generate by the class is given in atomic units.

Data attributes:

trajectories (dict): contain indices of the trajectories available
with the respective maximum time to read.

all_nacs (dict): store the NAC matrices (per state) for all
MD trajectories.

dataset (dict): dictionary of dataframes with the flattened gradient
matrices.

all_nacs

build_dataframe(save_csv=False) → None

Generate a dataset (pandas.DataFrame object) with all NACs.

The XYZ matrices with the nonadiabatic couplings obtained for each molecular geometry is flattened into a one dimensional vector. In this way, every row of the dataset corresponds to one step of a given MD trajectory.

Parameters:: save_csv (bool, optional) – if True enable the output of all NAC dataframes in a csv format, defaults to False. The default name of the exported dataset is all_nacs_s[n].csv, where n is the number of the state.

datasets

static from_h5(outfile) → dict

Read NACs from one trajectory generated by the new Newton-X.

Parameters:: outfile (str) – name of the NX output file in the .h5 format, usually dyn.h5.
Returns:: stacked nonadiabatic coupling vectors (NAC) for all steps of a single MD trajectory, provided as a dictionary of tensors (numpy.ndarray, one for each state) with shape (n_steps, n_atoms, 3).
Return type:: dict

static from_nxlog(outfile: str, tmax=100000) → dict: Read NACs from one trajectory generated by the Newton-X CS.

read_all_trajs() → None: Read NACs from all MD trajectories and store into a dictionary.

trajectories

class ulamdyn.data_loader.GetGradients

Bases: BaseClass

Class used to collect QM gradients from Newton-X MD trajectories.

This class does not require arguments in its constructor. The outputs generate by the class is given in eV/angstrom.

Data attributes:

trajectories (dict): contain indices of the trajectories available
with the respective maximum time to read.

all_grads (dict): store the gradient matrices (per state) for all
MD trajectories.

dataset (dict): dictionary of dataframes with the flattened gradient
matrices.

all_grads

build_dataframe(save_csv=False) → None

Generate a dataset (pandas.DataFrame object) with all gradients.

The XYZ matrices with the gradients of each molecular geometry is flattened into a one dimensional vector, such that every row of the dataset corresponds to one step of the MD trajectories.

Parameters:: save_csv (bool, optional) – if True enable the output of all gradient dataframes in a csv format, defaults to False. The default name of the saved dataset is all_gradients_s[n].csv, where n is the number of the state.

datasets

static from_h5(outfile) → dict

Read XYZ gradients from a single trajectory generated by Newton-X NS

Parameters:: outfile (str) – name of the NX output file in the .h5 format, usually dyn.h5.
Returns:: stacked XYZ gradients for all steps of a single MD trajectory, returned as as a dictionary of tensors (numpy.ndarray, one for each state) with shape (n_steps, n_atoms, 3).
Return type:: dict

static from_nxlog(outfile, tmax=100000) → dict

Read XYZ gradients from one trajectory generated by Newton-X CS.

This method has the same parameters and return type as in from_h5().

Parameters:: outfile (str) – Output file from Newton-X MD trajectory, namely nx.log.
Returns:: dictionary of numpy.ndarray with shape (n_steps, n_atoms, 3), where the keys correspond to the state’s id.
Return type:: dict

read_all_trajs() → None: Read gradients from all MD trajectories and store into a dictionary.

trajectories

class ulamdyn.data_loader.GetProperties

Bases: BaseClass

Read all properties available in the Newton-X MD trajectories.

Note

This class does not require arguments in its constructor. All the energy quantities processed by the class are transformed from Ha to eV. For the other properties, the original units used in Newton-X are kept.

Data attributes:

trajectories (dict): trajectories ID (TRAJXX) found in the working
directory are stored as keys of the dictionary, and the corresponding values are the maximum simulation time to be read.

dataset (pd.dataframe): store a dataframe with all properties.

num_states (int): keep track of the number of states considered in
the MD simulation.

nx_version (str): identify the Newton-X version, can be either cs
(classical series) or ns (new series).

dataset

energies()

Read / process the energy information from the en.dat (classical NX) or .h5 (new NX) file.

Returns:: a processed dataset with the information of all trajectories stacked, and containing the following columns “TRAJ”, “time”, “State”, “Total_Energy” plus the energy gaps between the accessible states (e.g., DE12) and binary columns to identify the hopping points (e.g., Hops_S21).
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

mcscf_coefs()

Collect the squared coefficients of the MCSCF wavefunction.

Note

This method works only for NAMD trajectories generated with Columbus.

Returns:: a dataset with the three highest MCSCF coefficients (the three largest) for each electronic state read from the NX output per state; if the class variable dataset has been already updated with some properties, the population data will be merged with the existing dataset.
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

nac_norm()

Calculate the norm of the nonadiabatic coupling matrices for each pair of states.

After running this function, the class variable dataset will be updated with the loaded NACs norm data.

Returns:: a dataset with the Frobenius norm of the nonadiabatic coupling matrices of all NX trajectories calculated for each possible transition between states; the number of columns in the NACs norm dataset is given by n_states * (n_states - 1)/2.
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

num_states

nx_version

oscillator_strength()

Collect the oscillator strength from properties file.

Note

The oscillator strength (OSS) is not always available in the output. If needed, check the Newton-Xdocumentation to see what are the required keywords and methods. The OSS data can be read from either classical or new series NX trajectories.

Returns:: a dataset with the oscillator strength information read from all available NX trajectories with one column for each transition between states; if the class variable dataset has been already updated with some properties, the oscillator strength data will be merged with the existing properties dataset.
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

populations()

Read and calculate the population for each accessible state.

Note

If the MD simulations were performed with the classical Newton-X, the populations will be calculated using the wavefunction coefficients. In the new Newton-X, the populations are already calculated and available in the .h5 file.

Returns:: a dataset with the populations obtained from all available NX trajectories with one column per state; if the class variable dataset has been already updated with some properties, the population data will be merged with the existing dataset.
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

property save_csv: None

Save the dataset with QM properties read from the NX trajectories.

The following properties are included in the dataset:

trajectory index;
simulation time;
total energy;
energy gaps between states (eV);
oscillator strength (if available);
states population.
norm of the nonadiabatic coupling matrices (if available);
three highest MCSCF coefficients per state (only for NX/Columbus);

trajectories

ulamdyn.data_writer module

class ulamdyn.data_writer.Geometries(atom_labels, properties_data=None, add_properties=[])

Bases: BaseClass

Handle and save XYZ coordinates for selected frames of MD trajectories.

save_xyz(geoms_array: ndarray, out_name: str = 'selected_geoms.xyz') → None

Save an XYZ file for a set of selected molecular geometries.

Note

The comment line of the XYZ file will contain a list of property values for each molecule.

Parameters:

geoms_array (numpy.ndarray) – a 3D array containing the list of XYZ matrices.
out_name (str) – name of the XYZ file containing all the selected geometries, defaults to selected_geoms.xyz.

ulamdyn.descriptors module

Module used to generate ML descriptors from molecular geometries.

class ulamdyn.descriptors.R2(all_geoms=None, use_mwc=False)

Bases: GetCoords

Class used to convert the XYZ coordinates of molecular geometries into R2-type of descriptors.

The R2 descriptor is defined as the (flattened) matrix of all pairwise Euclidean distances between all atoms in the molecule. Since the matrix is symmetric with respect to the interchange of atom indices (i.e., Dij = Dji) only the lower triangular portion of the R2 matrix will be outputed in the final data set.

This class also provides a method to compute other molecular descriptors derived from the R2 distance matrix. They are:

inverse R2 -> defined as \(1/R_{ij}\), similarly to the Coulomb
matrix descriptor.
delta R2 -> difference between the R2 vector of the current geometry in
time t and the equivalent R2 vector of a reference geometry (typically the ground-state geometry), \(R_{ij}(t) - R_{ij}(ref)\).
RE -> R2 vector normalized relative to equilibrium geometry,
\(R_{ij}(eq)/R_{ij}(t)\). For more details, check the reference J. Chem. Phys. 146, 244108 (2017).

apply_to_delta

build_descriptor(variant=None, apply_to_delta=None, save_csv=False)

Generate a dataframe with R2-based descriptors for all geometries.

Parameters:

variant (str, optional) – molecular representation derived from the R2 descriptor, defaults to None.
apply_to_delta (str, optional) – select a nonlinear function to apply as a transformation (sigmoid or hyperbolic tangent) on delta-R2 descriptors, defaults to None.
save_csv (bool, optional) – if True export the data set with all calculated descriptors in a csv format with name all_geoms_r2.csv, defaults to False.

Returns:

a dataframe object of shape (n_samples, n_atoms * (n_atoms - 1)/2), where each row is a vector with the R2-based descriptor computed for a given molecular geometry.

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

inverse_transform(descriptor): Apply all transformations in reverse order to recover the original R2 vector.

mass_weights

norm_factor

r2_descriptor

r2_ref_geom

static reconstruct_xyz(r2_vec: ndarray) → ndarray

Reconstruct the XYZ coordinates from the pairwise atom distance vector.

Parameters:: r2_vec (numpy.ndarray) – Numpy 2D array of shape (n_samples, n_atoms * (n_atoms - 1)/2) that contains stacked R2 vectors.
Returns:: Numpy 3D array of shape (n_samples, n_atoms, 3) containing the molecular geometries transformed back into the Cartesian coordinate space.
Return type:: numpy.ndarray

variant

static xyz_to_distances(xyz_matrix: ndarray) → ndarray

Calculate the pairwise distance matrix for a given XYZ geometry.

Parameters:: xyz_matrix (numpy.ndarray) – Cartesian coordinates of a molecular structure given as a matrix of shape (n_atoms, 3).
Returns:: vector of size \(n_{atoms} (n_{atoms} - 1)/2\) containing the lower triangular portion of the R2 matrix.
Return type:: numpy.ndarray

class ulamdyn.descriptors.RingParams(ring_atom_ind: list, ring_coords: ndarray = None)

Bases: GetCoords

Class used to the Cremer-Pople parameters from ring coordinates.

build_dataframe(save_csv=False): Construct a data frame containing the Cremer-Pople parameters of all collected geometries.

dataset

displacement(coords: ndarray): Calculate the ring displacement (z)

eq_xyz

get_conf_5memb(phi) → str: Determine the class of a 5-membered ring deformation based on the CP parameters.

get_conf_6memb(theta, phi) → str: Determine the conformation class for a 6-membered ring based on the CP parameters.

get_pucker_coords(coords: ndarray) → ndarray: Calculate the Cremer-Pople puckering parameters for one ring.

labels

property ring_coords: ndarray: Filter the XYZ coordinates corresponding to the selected ring and translate the coordinates to the ring center.

rmsd

traj_time

trajectories

xyz

class ulamdyn.descriptors.SOAPDescriptor(all_geoms=None, atoms=None, r_cut=14, n_max=8, l_max=6, average='outer', rbf='polynomial')

Bases: GetCoords

Class used to generate SOAP (Smooth Overlap of Atomic Positions) descriptors for molecular geometries.

This class extends the GetCoords class, allowing it to handle molecular geometries and generate SOAP descriptors based on the atomic coordinates and species extracted from those geometries.

atoms

create_features() → DataFrame

Generate SOAP descriptors for the provided molecular geometries.

Returns:: Array of SOAP descriptors for each molecular geometry.
Return type:: numpy.ndarray

soap

species

class ulamdyn.descriptors.ZMatrix(all_geoms=None)

Bases: GetCoords

Class used to generate molecular descriptors using internal coordinates (Z-Matrix).

This class does not require arguments in its constructor. All quantities related to distances are given in angstrom, while the features derived from angles are provided in degrees.

In addition to the standard Z-Matrix, the class also provides a method to compute other variants of the Z-Matrix molecular descriptors:

delta Z-Matrix -> difference between the Z-Matrix representation of the current geometry in time t and the Z-Matrix of a reference geometry.
tanh Z-Matrix -> hyperbolic tangent transformation on all features of delta Z-Matrix.
sig Z-Matrix -> sigmoid transformation on all features of delta Z-Matrix.

Data attributes:

distancematrix (numpy.ndarray): stores the full matrix of bond
distances for all geometries.

connectivity (list): indices of connected atoms based on a distance
criterion of proximity.

angleconnectivity (list): indices of three neighboring atoms to
compute angles.

dihedralconnectivity (list): four indices of neighboring atoms to
calculate dihedrals.

zmat_ref_geom (numpy.ndarray): stores the Z-matrix calculated for
the reference geometry.

angleconnectivity

build_descriptor(delta=False, apply_to_delta=None, save_csv=False)

Construct the standard Z-Matrix descriptor and other variants.

Note

By default, the algorithm will calculate the three main components of the Z-Matrix: bond distances, angles and dihedrals. An augmented version of the Z-Matrix descriptor can be also obtained by calculating additional distances, angles, dihedrals and/or bending angles using the methods provided in the class.

Parameters:

delta (bool, optional) – if True, the Z_matrix feature vector of each geometry will be subtracted from the Z_Matrix of the reference geometry, defaults to False.
apply_to_delta (str, optional) – select a nonlinear function to apply as a transformation (sigmoid or hyperbolic tangent) on the delta Z-matrix, defaults to None.
save_csv (bool, optional) – if true save a single csv file named all_geoms_zmatrix.csv containing the Z-Matrix descriptors computed for all geometries available the MD trajectories., defaults to False.

Returns:

a dataframe object with the (flattened) Z-Matrix descriptors stacked for all MD geometries.

Return type:

pandas.DataFrame

connectivity

dihedralconnectivity

distancematrix

static get_angle(geom: ndarray, idx_atoms: list) → float64

Calculate the angle formed by three selected atoms.

Parameters:

geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.
idx_atoms (list) – a list of three indices corresponding to the atoms for which the angle will be calculated.

Returns:

angle (in degrees) between three selected atoms.

Return type:

numpy.float

static get_bending(geom: ndarray, idx_atoms: list) → float64

Calculate the bending angle between two planes of the molecule.

This method is particularly useful to describe large out-of-plane distortions in the molecular structure that involves more than four atoms. The bending angle is calculated by first defining two vectors, each one perpendicular to different molecular planes formed by two sets of three atoms. Then, the angle between the two vectors is obtained by calculating the inverse cosine of the scalar product between these vectors.

Note

By default, the bending angle is not used to construct the Z-Matrix descriptor. It can be used to construct an augmented version of the Z-Matrix that better captures changes in the molecular structure during the dynamics.

Parameters:

geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.
idx_atoms (list) – a list of lists with three atom indices in each, used to define two molecular planes for which the angle will be calculated.

Returns:

bending angle (in degrees) defined by six specified atoms.

Return type:

np.float

static get_dihedral(geom: ndarray, idx_atoms: list) → float64

Calculate the dihedral angle formed by four selected atoms.

Parameters:

geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.
idx_atoms (list) – a list of four indices to select the atoms for which the dihedral angle will be calculated.

Returns:

dihedral angle (in degrees) formed by four specified atoms.

Return type:

numpy.float

static get_distance(geom: ndarray, idx_atoms: list) → float64

Calculate the Euclidean distance between a pair of atoms.

Parameters:

geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.
idx_atoms (list) – a pair of indices corresponding to the atoms for which the distance will be calculated.

Returns:

Euclidean distance (in Angstrom) between two selected atoms.

Return type:

numpy.float

zmat_ref_geom

ulamdyn.kinetics module

class ulamdyn.kinetics.GetVelocities(n_atoms=None)

Bases: BaseClass

Class used to collect velocities from all NAMD trajectories generated by Newton-X.

build_dataframe(save_csv=False) → None

Generate a dataset (pandas.DataFrame object) containing all velocities.

The XYZ matrices with the velocities of each molecular geometry is flattened into a one dimensional vector, such that every row of the dataset corresponds to one step of the MD trajectories.

Parameters:: save_csv (bool, optional) – if True, the dataframe will be exported in a csv format, defaults to False. The default name of the saved dataset is all_velocities.csv.

dataset

classmethod from_all_trajs(n_atoms=None)

Wrapper function to collect the MD velocities without instantiating the class.

Parameters:: n_atoms (int, optional) – define the total number of atoms to consider in the velocities dataset, defaults to None. If None, the value will be taken from the geom.xyz file.
Returns:: the full velocities data set given as a tensor of shape (n_steps, n_atoms, 3).
Return type:: numpy.ndarray

from_dyn(outfile: str, tmax: float = 100000) → ndarray

Read velocities from a single trajectory of Newton-X.

Note

The final velocities dataset will contains information only for the first set of n_atoms, as defined by the class attribute n_atoms.

Parameters:

outfile (str) – name of the NX output file containing the velocities, dyn.out.
tmax (float) – Maximum time step to read from the NX output of a given trajectory.

Returns:

tensor of shape (n_steps, n_atoms, 3) with all velocities data.

Return type:

numpy.ndarray

from_h5(outfile: str) → ndarray

Read velocities from the .h5 file generated by the new Newton-X.

Note

The full velocities dataset will be sliced to select only the first set of n_atoms, as defined by the class attribute n_atoms.

Parameters:: outfile (str) – name of the NX output file, usually dyn.h5.
Returns:: tensor of shape (n_steps, n_atoms, 3) with all velocities data.
Return type:: numpy.ndarray

n_atoms

read_all_trajs()

Concatenate the XYZ velocities read from all available MD trajectories.

After running this method, the class attribute veloc will be updated with the full dataset of velocities read from the Newton-X output files.

trajectories

veloc

class ulamdyn.kinetics.KineticEnergy(n_atoms=None)

Bases: object

Class used to calculate the atomic or molecular components of the MD kinetic energy.

atom_labels

atom_mass

build_dataframe(discretization_level: str = 'atom', n_mols_per_type: int = None, n_atoms_per_mol: int = None, save_csv: bool = False) → DataFrame

Create a dataset containing the kinetic energies splitted by atoms or molecules.

Parameters:

discretization_level (str, optional) – how to split the kinetic energy, defaults to “atom”
n_mols_per_type (int, optional) – number of molecules per subsystem, typically one solute and several solvent molecules as in QM/MM, defaults to None.
n_atoms_per_mol (int, optional) – number of atoms in each molecular subsystem, defaults to None.
save_csv (bool, optional) – export the dataset with the computed kinetic energies as a csv file, defaults to False

Returns:

a structured dataset with the values of kinetic energies computed for all trajectories, where each row corresponds to one time step of a given trajectory and the columns refer to the discretization level (atoms or molecules).

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

calc_per_atom() → ndarray

Calculate the kinetic energy of each atom for all MD trajectories available.

Returns:: matrix with the kinetic energy of the selected atoms arranged

in columns. :rtype: numpy.ndarray

calc_per_molecule(n_mols_per_type: list, n_atoms_per_mol: list) → ndarray

Sum the atomic contributions of kinetic energy for each molecule.

Note

In the current version, this function only supports two different types of molecules in the geom file, where the first set of atoms in the geom file should always correspond to the solute. The function will be generalized in future versions to support any number of different molecules.

Parameters:

n_mols_per_type (list) – give the number of molecules in each one of the two subsystems, typically one solute and several solvent molecules as in QM/MM.
n_atoms_per_mol (list) – how many atoms exist in each molecular subsystem.

Returns:

a matrix of shape (n_samples, n_mols_1 + n_mols_2) with the kinetic energy of each molecule in the system arranged in columns.

Return type:

numpy.ndarray

energies

n_atoms

trajectories

class ulamdyn.kinetics.VibrationalSpectra(n_atoms=None)

Bases: GetVelocities

Calculate the vibrational density of states for each MD trajectory.

build_dataframe(save_csv=False)

Create a dataset with information of vibrational spectra per MD trajectory.

Parameters:: save_csv (bool, optional) – if True, export the dataset with the computed spectra to a csv file, defaults to False.
Returns:: two-columns dataframe storing the vibrational frequencies (cm^-1) in the first columns and the density of states in the second one (a.u.)
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

calc_all_trajs(mass_weighted=True): Calculate the vibrational spectra for each MD trajectory.

static calc_pdos(Vel: ndarray, dt: float, mass: ndarray = None) → ndarray

Calculate the power spectrum of the velocity auto-correlation function.

Parameters:

Vel (numpy.ndarray) – tensor of shape (n_steps, n_atoms, 3) with atomic velocities collected from a single MD trajectory.
dt (numpy.ndarray) – time step used to integrate the classical equations.
mass – if provided, rescale the velocities by the atomic mass of each specie; the default is None.

Returns:

two-columns array containing the calculated frequencies (cm^-1) in the first columns and the density of states in the second one (a.u.)

Return type:

numpy.ndarray

dataset

n_atoms

trajectories

veloc

ulamdyn.nx_utils module

Module with auxiliary functions and constants used to process Newton-X data.

ulamdyn.nx_utils.check_nx_trajs() → dict

Check the status of each trajectory and set the maximum time step to read accordingly.

Note

If a two-columns file named ‘trajs_tmax.dat’ is available in the working directory, the data available for each trajectory will be collected until the tmax value defined in the file. Otherwise, the data will be read until the last step of the dynamics before any error message found in the output file.

Returns:: A dictionary with the trajectory ID as the key and the tmax as the values.
Return type:: dict

ulamdyn.nx_utils.get_labels_masses(traj_dir: str, n_atoms: int = None) → Tuple[ndarray, ndarray]

Extract the atom labels and atomic masses from the NX geometry input.

Parameters:

traj_dir (str) – Name of the TRAJ directory from which the geometry file will be read.
n_atoms (_type_, optional) – If provided, the first n_atoms of the XYZ coordinates matrix will be selected and returned, defaults to None.

Returns:

Two separate numpy arrays containing the atom labels and the respective masses.

Return type:

Tuple[np.ndarray, np.ndarray]

ulamdyn.nx_utils.get_num_atoms() → int

Get the number of atoms from a reference ‘geom.xyz’ file.

Returns:: The number of atoms in a molecule or fragment of it.
Return type:: int

ulamdyn.nx_utils.get_nx_version(traj_dir: str) → str

Check Newton-X version.

Parameters:: traj_dir (str) – Name of the TRAJ directory in which the configuration file will be verified to identify the NX version.
Returns:: Identifier of the NX version. It can be either ‘cs’ (Classical series) or ‘ns’ (New series).
Return type:: str

ulamdyn.nx_utils.get_traj_dirs() → list

Generate a list of TRAJ directories to be processed.

Returns:: A list with the name of all ‘TRAJXX’ directories (sorted in ascending order) generated by the NX simulations available in the current directory.
Return type:: list

ulamdyn.nx_utils.read_h5_nx(traj_dir: str)

Read trajectory data from h5 file generated by Newton-X new series.

Parameters:: traj_dir – Name of the TRAJ directory from which the geometry file will be read.
Returns:: H5py data object where trajectory data can be accessed via keywords, similarly to a dictionary.
Return type:: h5py._hl.files.File

ulamdyn.nx_utils.read_nx_control(traj_dir: str) → dict

Read and store the NX simulation settings from the configuration file.

Parameters:: traj_dir (str) – Name of the TRAJ directory from which the geometry file will be read.
Returns:: Dictionary with all Newton-X keywords and corresponding values read from the control.dyn file (NX-CS) or from the nx_exported_config.nml / user_config.nml file (NX-NS).
Return type:: dict

ulamdyn.statistics module

Module to perform statistical analysis of the MD datasets.

ulamdyn.statistics.aggregate_data(data: DataFrame, vars_to_group: list = ['time']) → DataFrame

Group data based on variables to calculate the statistical descriptors.

The statistical quantities calculated by this function are mean and median to describe the central tendency, and standard deviation to measure the variability or dispersion of the data. In addition, the skewness and kurtosis of the distribution are also computed. So each feature of the original input data will be unfolded into five new columns identified with the suffixes ‘_median’, ‘_mean’, ‘_std’, ‘_skew’, and ‘_kurt’.

Parameters:

data (pandas.DataFrame) – input dataset containing the information extracted from all MD trajectories.
vars_to_group (list, optional) – set of variables used by the function to group the data, defaults to [“time”]

Returns:

dataframe object with statistical description of the input data

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

ulamdyn.statistics.bootstrap(dataframe, n_samples=None, n_repeats=1000, save_csv=False)

Estimate the mean by resampling the data with replacement.

Parameters:

dataframe (pandas.DataFrame | modin.pandas.dataframe.DataFrame) – input dataset having the quantities extrated from all available MD trajectories; the dataset must contain the ‘TRAJ’ and ‘time’ columns
n_samples (int, optional) – number of trajectories considered in each round of the resampling, defaults to None
n_repeats (int, optional) – number of resampling to be performed, defaults to 1000
save_csv (bool, optional) – if true outputs the final bootstrapped data in a csv file, defaults to False

Returns:

dataset containing all the bootstrapped data with estimated mean

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

ulamdyn.statistics.calc_avg_occupations(df)

Calculate the fraction of trajectories in each state as a function of time.

Note

The fraction of trajectories (occupation) is an important quantity to assess the quality of the surface hopping (SH) simulations. It should be compared to the population of each state averaged over all trajectories, which is provided by the method aggregate_data(). If the ensemble of SH trajectories is statistically converged, the occupation and the average population should match.

Parameters:: df (pandas.DataFrame | modin.pandas.dataframe.DataFrame) – properties dataset with information collected from the available MD trajectories
Returns:: a new dataset with the occupations calculated for each state
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

ulamdyn.statistics.create_bootstrap_stats(boot_data, ci_level=95)

Generate the descriptive statistics for the bootstrapped data within a given confidence interval.

Parameters:

boot_data (pandas.DataFrame | modin.pandas.dataframe.DataFrame) – bootstrapped dataframe generate by the bootstrap() function
ci_level (int, optional) – Size of the confidence interval to draw when aggregating the data (by time) to estimate the statistical descriptors (median, mean, std), defaults to 95

Returns:

dataframe grouped by the simulation time with each feature of the original dataset unfolded into three columns representing the statistical descriptors and other two columns storing the values for the lowest and highest variability as defined by the confidence interval

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

ulamdyn.statistics.create_stats(selected_data, save_csv=False)

Create datasets with statistical summary of the requested quantities.

Parameters:

selected_data (str) – type of data used as input to compute the descriptive statistical properties using the aggregate_data() function
save_csv (bool, optional) – if true exports the calculated statistics for each dataframe in a csv format, defaults to false

Returns:

one or several datasets containing the median, mean and standard deviation calculated for each feature of the input as a function of time; if the attribute is equal to ‘all’, the statistics will be calculated for the properties and descriptors (R2 and Z-Matrix) datasets.

Return type:

dict(pandas.DataFrame) | dict(modin.pandas.dataframe.DataFrame)

ulamdyn.statistics.stats_hopping(dataframe)

Generate a statistical summary for the hopping points.

Parameters:: dataframe (pandas.DataFrame | modin.pandas.dataframe.DataFrame) – properties dataset with information collected from the available MD trajectories
Returns:: a new dataset containing the total number of hops per trajectory and the minimum and maximum time in which the hops between each pair of electronic states occur.
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

Module contents

ULaMDyn - Unsupervised Learning analysis for Molecular Dynamics

ULaMDyn is a python package built on top of sklearn designed to perform data preprocessing, statistical and unsupervised learning analysis of (non-adiabatic) molecular dynamics simulations.

ULaMDyn is composed of several Python modules:

Data_Loader
Data_Writer
Statistics
Kinetics
Descriptors
Normal mode analysis
Unsup_Models
NMA

class ulamdyn.ClusterGeoms(data, dt=None, indices=None, n_samples=None, scaler=None, random_state=51, n_cpus=-1, verbosity=0)

Bases: Utils

Class to find groups of similar geometries in MD trajectories data

gaussian_mixture(n_clusters=5, covariance='full', tol=0.0001, n_init=10, max_iter=500, init='k-means++', save_model=True)

Perform probabilist clustering in geometry space with GMM

Parameters:

n_clusters (int, optional) – The number of clusters to find, which corresponds to the number of mixed gaussians. Defaults to 5.
covariance (str, optional) – String describing the type of covariance parameters to use, default is “full”. Acceptable values are “full”, “tied”, “diag”, “spherical” (equivalent to K-Means).
tol (float, optional) – Convergence criteria of the lower bound average gain, below which the EM iterations stop, defaults to 1e-4.
n_init (int, optional) – The number of initializations to perform, where best results are kept. The default is 10.
max_iter (int, optional) – The number of EM iterations to perform. Defaults to 500.
init – The method used to initialize the weights, the means and the precisions, default is “k-means++”. Acceptable strings are “k-means”, “k-means++”, “random”, or “‘random_from_data”.

hierarchical(n_clusters=5, affinity='euclidean', connectivity=None, linkage='complete', distance_threshold=None, save_model=True)

Perform a hierarchical cluster analysis based on the agglomerative

Parameters:

n_clusters (int, optional) – The number of clusters to find. It must be None if distance_threshold is not None, defaults to 5.
affinity (str, optional) – Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix (instead of a similarity matrix) is needed as input for the fit method, defaults to “euclidean”.
connectivity (array-like or callable, optional) – Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix, such as derived from kneighbors_graph. Default is None, meaning that hierarchical clustering algorithm is unstructured.
linkage (str, optional) –
Define the linkage criterion to build the tree. It determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. The options are : + ‘ward’ -> minimizes the variance of the clusters

being merged.
- ’average’ -> uses the average of the distances of
  each observation of the two sets.
- ’complete’ or ‘maximum’ -> uses the maximum distances
  
  between all observations
  
  of the two sets.
- ’single’ -> uses the minimum of the distances between
  all observations of the two sets.
The default is “complete”.
distance_threshold (float, optional) – The linkage distance threshold above which, clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True. Defaults to None.
save_model (bool, optional) – Store the trained parameters of the model in a binary file. Defaults to True.

Returns:

Dataframe of shape (n_samples,) with cluster labels for each data point.

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

kmeans(n_clusters=5, init='k-means++', n_init=100, max_iter=1000, convergence=1e-06, save_model=True)

Perform K-Means clustering in geometry space.

Parameters:

n_clusters (int, list or str optional) – Specifies the number of clusters (and centroids) to form. Defaults to 5. If a list is provided, K-Means will run for each value in the list. If set to ‘best’, the algorithm will perform multiple runs with n_clusters ranging from 2 to 15. The best result is selected based on silhouette and Calinski-Harabasz scores.
init (str or array, optional) – Method for initialization : + ‘k-means++’ -> selects initial cluster centers for K-means clustering in a smart way to speed up convergence. + ‘random’ -> choose n_clusters observations (rows) at random from data for the initial centroids. + If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. Defaults to “k-means++”.
n_init (int, optional) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of loss function, defaults to 500.
max_iter (int, optional) – Maximum number of iterations of the k-means algorithm for a single run, defaults to 1000.
convergence (float, optional) – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence. Defaults to 1e-06.
save_model (bool, optional) – Store the trained parameters of the model in a binary file, defaults to True.

Returns:

Dataframe of shape (n_samples,) with cluster labels for each data point.

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

spectral(n_clusters=5, n_components=10, n_init=100, affinity='rbf', gamma=0.01, n_neighbors=10, degree=3, coef0=1, kernel_params=None, save_model=True)

Apply clustering to a projection of the normalized Laplacian.

Note that Spectral Clustering is a highly expensive method due to the computation of the affinity matrix. Hence, this method is recommended only for small to medium size datasets (n_samples < 10000).

Note

This method is equivalent to kernel K-means (DOI: 10.1145/1014052.1014118). Spectral clustering is recommended for non-linearly separable dataset, where the individual clusters have a highly non-convex shape.

Parameters:

n_clusters (int, optional) – The number of clusters to form which in this case corresponds to the dimension of the projection subspace. Defaults to 5. If a list is passed, spectral clustering will be run for all n_clusters in the list, whereas if the argument is equal to ‘best’, consecutive runs will be performed with n_clusters varying in the range of [2, 15]. In both cases, the final results will be the best output labels with respect to the clustering performance on the silhouette and Calinski-Harabasz scores.
n_components (int, optional) – Number of eigenvectors to use for the spectral embedding. Defaults to 10
n_init (int, optional) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Only used if assign_labels=’kmeans’. Defaults to 100.
affinity (str or callable, optional) –
Method used to construct the affinity matrix. The available options are : + ‘nearest_neighbors’: construct the affinity matrix

by computing a graph of nearest neighbors.
- ’rbf’: construct the affinity matrix using a radial basis function (RBF) kernel.
- ’precomputed_nearest_neighbors’: interpret X as a sparse graph of precomputed distances, and construct a binary affinity matrix from the n_neighbors nearest neighbors of each instance.
- one of the kernels supported by pairwise_kernels.
The default method is “rbf”.
gamma (float, optional) – Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for affinity=’nearest_neighbors’. Defaults to 0.01.
n_neighbors (int, optional) – Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for affinity=’rbf’. Defaults to 20.
degree (int, optional) – Degree of the polynomial kernel. Ignored by other kernels. Defaults to 3.
coef0 (int, optional) – Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels. Defaults to 1.
kernel_params (dict or str, optional) – Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels. Defaults to None.
save_model (bool, optional) – Store the trained parameters of the model in a binary file. Defaults to True.

Returns:

Dataframe of shape (n_samples,) with cluster labels for each data point.

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

class ulamdyn.ClusterTrajs(data, dt=None, scaler=None, random_state=42, n_cpus=-1, verbosity=0)

Bases: Utils

Class used to find groups of similar trajectories in NAMD data.

kmeans(n_clusters=3, metric='dtw', metric_params=None, n_init=5, max_iter=100, convergence=1e-06, save_model=True)

K-means clustering to group similar MD trajectories.

Parameters:

n_clusters (int, optional) – Number of clusters to form, defaults to 3.
metric (str, optional) – Metric to be used for both cluster assignment and barycenter computation. Options: “euclidean”, “dtw”, “softdtw”. Defaults to “dtw”.
metric_params (dict or None, optional) – Parameter values for the chosen metric. Defaults to None.
n_init (int, optional) – Number of time the K-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Defaults to 5.
max_iter (int, optional) – Maximum number of iterations of the k-means algorithm for a single run. Defaults to 100.
convergence (float, optional) – Inertia variation threshold. If at some point, inertia varies less than this threshold between two consecutive iterations, the model is considered to have converged and the algorithm stops. Defaults to 1e-6.
save_model (bool, optional) – Store the trained parameters of the model in a binary file. Defaults to True.

Returns:

Dataframe of shape (n_trajs,) with cluster labels assigned to each trajectory.

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

transform()

Convert input data to time series format (tslearn) and apply scaler methods.

Returns:: Three dimensional numpy array with shape (n_ts, n_steps, n_features)
Return type:: numpy.ndarray

class ulamdyn.DimensionReduction(data, dt=None, n_samples=None, scaler=None, random_state=42, n_cpus=-1)

Bases: Utils

Find low dimensional representation of MD trajectories data.

isomap(n_components=2, n_neighbors=30, neighbors_algorithm='auto', metric=None, p=2, metric_params=None, calc_error=False)

Perform a nonlinear dimensionality reduction with Isometric Mapping

Parameters:

n_components (int, optional) – Number of coordinates (features) for the low-dimensional manifold, defaults to 2.
n_neighbors (int, optional) – Number of neighbors to consider around each point, defaults to 12.
neighbors_algorithm (str, optional) – Method used for nearest neighbors search, defaults to “auto”
metric (str or callable, optional) – The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by sklearn.metrics.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. Defaults to “cosine”.
p (int, optional) – Parameter for the Minkowski metric from sklearn.metrics.pairwise pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p minkowski_distance (l_p) is used. Defaults to 2.
metric_params (dict, optional) – Additional keyword arguments for the metric function. Defaults to None.
calc_error (bool, optional) – If True, the reconstruction error between the original and the projected data will be calculated defaults to False.

Returns:

a new dataset with the transformed values where the coordinates of the low-dimensional manifold are stored in columns.

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

kpca(n_components=2, kernel='rbf', gamma=None, degree=4, coef0=1, kernel_params=None, alpha=1.0, fit_inverse_transform=False)

Perform a nonlinear dimensionality reduction using kernel PCA.

Parameters:

n_components (int, optional) – Number of components (features) to keep after KPCA transformation, defaults to 2.
kernel (str, optional) – Kernel function used in the transformation. The possible values are ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘cosine’ or precomputed’, defaults to “rbf”.
gamma (float, optional) – Kernel coefficient for rbf, poly and sigmoid kernels. Ignored by other kernels. If gamma is None, then it is set to 1/n_features. Defaults to None.
degree (int, optional) – Degree of polynomial kernel. Ignored by other kernels. Defaults to 4.
coef0 (int, optional) – Independent term in poly and sigmoid kernels. Ignored by other kernels. Defaults to 1.
kernel_params (dict, optional) – Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels. Defaults to None.
alpha (float, optional) – Hyperparameter of the ridge regression that learns the inverse transform (when fit_inverse_transform=True), defaults to 1.0.
fit_inverse_transform (bool, optional) – Hyperparameter of the ridge regression that learns the inverse transform (when fit_inverse_transform=True), defaults to False.

Returns:

a new dataset with the transformed values where the selected components are stored in columns.

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

pca(n_components=2, calc_error=False, save_errors=False)

Perform linear dimension reduction with principal component analysis

Note

By default the percentage of variance explained by each of the selected components will be printed after the PCA analysis.

Parameters:

n_components (int, optional.) – Number of principal components to keep defaults to 2.
calc_error (bool, optional) – If True, the reconstruction error between the original and the projected data will be calculated defaults to False.
save_errors (bool, optional) – If True, save the reconstruction error calculated for each sample to a csv file, defaults to False.

Returns:

a new dataset with the transformed values where the selected components are stored in columns.

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

tsne(n_components=2, perplexity=40.0, learning_rate=180.0, n_iter=2000, n_iter_without_progress=400, metric='euclidean', init='pca', verbose=1, method='barnes_hut')

Perform the t-distributed Stochastic Neighbor Embedding analysis.

Parameters:

n_components (int, optional) – Number of coordinates (features) for the low-dimensional embedding, defaults to 2.
perplexity (float, optional) – This hyperparameter is used to control the attention between local and global aspects of the data, in a certain sense, by guessing the number of close neighbors each point has. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significantly different results. Defaults to 40.
learning_rate (float, optional) – The learning rate for t-SNE is usually in the range [10.0, 1000.0]. If the learning rate is too high, the data may look like a ‘ball’ with any point approximately equidistant from its nearest neighbours. If the learning rate is too low, most points may look compressed in a dense cloud with few outliers. If the cost function gets stuck in a bad local minimum increasing the learning rate may help. Defaults to 200.0
n_iter (int, optional) – Maximum number of iterations for the optimization. Should be at least 250. Defaults to 2000.
n_iter_without_progress (int, optional) – Maximum number of iterations without progress before we abort the optimization, used after 250 initial iterations with early exaggeration. Note that progress is only checked every 50 iterations so this value is rounded to the next multiple of 50. Defaults to 400.
metric (str or callable, optional) – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy. spatial.distance.pdist for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them. The default is “euclidean” which is interpreted as squared euclidean distance. Defaults to “euclidean”.
init (str, optional) – Initialization of embedding. Possible options are ‘random’, ‘pca’, and a numpy array of shape (n_samples, n_components). PCA initialization cannot be used with precomputed distances and is usually more globally stable than random initialization. Defaults to “pca”.
verbose (int, optional) – Verbosity level. Defaults to 1
method (str, optional) – By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples. Defaults to “barnes_hut”.

Returns:

a new dataset with the transformed values where the coordinates of the low-dimensional manifold are stored in columns.

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

class ulamdyn.GeomSampling(descriptor, n_samples, n_clusters, transform=None, use_mwc=False, feat_scaler=None)

Bases: object

Class used to sample new molecular geometries from GMM clustering.

gen_geometries(num_geoms, from_cluster=0, where_prop='')

property properties_data

save_xyz(filename: str = 'sampled_geoms.xyz')

class ulamdyn.Geometries(atom_labels, properties_data=None, add_properties=[])

Bases: BaseClass

Handle and save XYZ coordinates for selected frames of MD trajectories.

save_xyz(geoms_array: ndarray, out_name: str = 'selected_geoms.xyz') → None

Save an XYZ file for a set of selected molecular geometries.

Note

The comment line of the XYZ file will contain a list of property values for each molecule.

Parameters:

geoms_array (numpy.ndarray) – a 3D array containing the list of XYZ matrices.
out_name (str) – name of the XYZ file containing all the selected geometries, defaults to selected_geoms.xyz.

class ulamdyn.GetCoords

Bases: BaseClass

Read the Cartesian coordinates from Newton-X MD trajectories.

Note

In the case of NX classical series, the Cartesian XYZ coordinates can be read either from dyn.out or dyn.xyz. For NX new series, the data will be read from the h5 file available in each TRAJ directory. Repeated geometries are skipped.

It also provides a function to calculate the root-mean-squared deviation (RMSD) between each geometry read from the MD trajectories and a reference geometry. Before calculating the RMSD, the two geometries are aligned using the Kabsch algorithm (https://en.wikipedia.org/wiki/Kabsch_algorithm).

This class does not require arguments in its constructor. All the outputs generate by the class are given in angstroms.

Data attributes:

trajectories (dict): contain indices of the trajectories available
with the respective maximum time to read.

labels (numpy.ndarray): stores the sequence of atom labels.

eq_xyz (numpy.ndarray): stores the XYZ matrix of the reference
geometry (geom.xyz).

xyz (numpy.ndarray): stores the XYZ matrices of all geometries read
from the TRAJ directories.

rmsd (numpy.ndarray): vector of RMSD values between all geometries
and the reference one.

dataset (pandas.dataframe): stores a dataframe of the flattened
XYZ matrices.

align_geoms(xyz_data=None) → None

Calculate the RMSD between the current and reference geometries.

Note

Before calculating the RMSD, the method uses the Kabsch algorithm to find the optimal alignment between the loaded molecular geometry for each time step t and the reference geometry.

The calculated RMSD values will be stored in the class attribute rmsd_values, while the xyz attribute will be updated with the aligned geometries.

build_dataframe() → None

Create a DataFrame containing flattened XYZ coordinates from all MD trajectories.

After running this function, the class attribute dataset will be updated with the loaded DataFrame object.

dataset

eq_xyz

from_dyn(outfile: str, tmax=100000) → Tuple[ndarray, ndarray, ndarray]

Get XYZ coordinates from the dyn.out file of classical Newton-X.

Parameters:: outfile (str) – name of the NX output file, dyn.out
Returns:: tuple containing an array of strings defining the atom labels, a tensor of size (n_steps, n_atoms, 3) with all XYZ coordinates of a single MD trajectory, and an array with the time steps.
Return type:: tuple in the form (numpy.ndarray, numpy.ndarray, numpy.ndarray)

static from_h5(outfile: str) → Tuple[ndarray, ndarray, ndarray]

Read the XYZ coordinates of the MD frames generated by Newton-X NS.

Parameters:: outfile (str) – name of the NX output file, usually dyn.h5
Returns:: tuple containing an array of strings defining the atom labels, a tensor of size (n_steps, n_atoms, 3) with all XYZ coordinates of a single MD trajectory, and an array with the time steps.
Return type:: tuple in the form (numpy.ndarray, numpy.ndarray, numpy.ndarray)

static from_xyz(xyzfile) → Tuple[ndarray, ndarray]

Get XYZ coordinates from the dyn.xyz file of classical Newton-X.

This method has the same parameters and return type as in from_dyn().

labels

read_all_trajs(calc_rmsd: bool = True) → None

Concatenate the XYZ coordinates read from all MD trajectories.

After running this method, the class attributes labels and xyz will be updated.

read_eq_geom() → None

Read the XYZ coordinates of a reference geometry.

Note

This method should be executed before calculating the RMSD with the align_geoms() function.

A file with name geom.xyz must be provided in the working directory (TRAJECTORIES). After reading the coordinates, the method will update the class attributes labels and eq_xyz.

rmsd

save_csv() → None

Save all loaded geometries (raw format) into a csv file.

If the RMSD has been calculated, it will be included as an extra column in the XYZ coordinates data set.

The default name of the output file is all_coordinates.csv.

traj_time

trajectories

xyz

class ulamdyn.GetCouplings

Bases: BaseClass

Class used to read the Nonadiabatic Coupling Vectors (NAC) from the MD trajectories.

This class does not require arguments in its constructor. The outputs generate by the class is given in atomic units.

Data attributes:

trajectories (dict): contain indices of the trajectories available
with the respective maximum time to read.

all_nacs (dict): store the NAC matrices (per state) for all
MD trajectories.

dataset (dict): dictionary of dataframes with the flattened gradient
matrices.

all_nacs

build_dataframe(save_csv=False) → None

Generate a dataset (pandas.DataFrame object) with all NACs.

The XYZ matrices with the nonadiabatic couplings obtained for each molecular geometry is flattened into a one dimensional vector. In this way, every row of the dataset corresponds to one step of a given MD trajectory.

Parameters:: save_csv (bool, optional) – if True enable the output of all NAC dataframes in a csv format, defaults to False. The default name of the exported dataset is all_nacs_s[n].csv, where n is the number of the state.

datasets

static from_h5(outfile) → dict

Read NACs from one trajectory generated by the new Newton-X.

Parameters:: outfile (str) – name of the NX output file in the .h5 format, usually dyn.h5.
Returns:: stacked nonadiabatic coupling vectors (NAC) for all steps of a single MD trajectory, provided as a dictionary of tensors (numpy.ndarray, one for each state) with shape (n_steps, n_atoms, 3).
Return type:: dict

static from_nxlog(outfile: str, tmax=100000) → dict: Read NACs from one trajectory generated by the Newton-X CS.

read_all_trajs() → None: Read NACs from all MD trajectories and store into a dictionary.

trajectories

class ulamdyn.GetGradients

Bases: BaseClass

Class used to collect QM gradients from Newton-X MD trajectories.

This class does not require arguments in its constructor. The outputs generate by the class is given in eV/angstrom.

Data attributes:

trajectories (dict): contain indices of the trajectories available
with the respective maximum time to read.

all_grads (dict): store the gradient matrices (per state) for all
MD trajectories.

dataset (dict): dictionary of dataframes with the flattened gradient
matrices.

all_grads

build_dataframe(save_csv=False) → None

Generate a dataset (pandas.DataFrame object) with all gradients.

The XYZ matrices with the gradients of each molecular geometry is flattened into a one dimensional vector, such that every row of the dataset corresponds to one step of the MD trajectories.

Parameters:: save_csv (bool, optional) – if True enable the output of all gradient dataframes in a csv format, defaults to False. The default name of the saved dataset is all_gradients_s[n].csv, where n is the number of the state.

datasets

static from_h5(outfile) → dict

Read XYZ gradients from a single trajectory generated by Newton-X NS

Parameters:: outfile (str) – name of the NX output file in the .h5 format, usually dyn.h5.
Returns:: stacked XYZ gradients for all steps of a single MD trajectory, returned as as a dictionary of tensors (numpy.ndarray, one for each state) with shape (n_steps, n_atoms, 3).
Return type:: dict

static from_nxlog(outfile, tmax=100000) → dict

Read XYZ gradients from one trajectory generated by Newton-X CS.

This method has the same parameters and return type as in from_h5().

Parameters:: outfile (str) – Output file from Newton-X MD trajectory, namely nx.log.
Returns:: dictionary of numpy.ndarray with shape (n_steps, n_atoms, 3), where the keys correspond to the state’s id.
Return type:: dict

read_all_trajs() → None: Read gradients from all MD trajectories and store into a dictionary.

trajectories

class ulamdyn.GetProperties

Bases: BaseClass

Read all properties available in the Newton-X MD trajectories.

Note

This class does not require arguments in its constructor. All the energy quantities processed by the class are transformed from Ha to eV. For the other properties, the original units used in Newton-X are kept.

Data attributes:

trajectories (dict): trajectories ID (TRAJXX) found in the working
directory are stored as keys of the dictionary, and the corresponding values are the maximum simulation time to be read.

dataset (pd.dataframe): store a dataframe with all properties.

num_states (int): keep track of the number of states considered in
the MD simulation.

nx_version (str): identify the Newton-X version, can be either cs
(classical series) or ns (new series).

dataset

energies()

Read / process the energy information from the en.dat (classical NX) or .h5 (new NX) file.

Returns:: a processed dataset with the information of all trajectories stacked, and containing the following columns “TRAJ”, “time”, “State”, “Total_Energy” plus the energy gaps between the accessible states (e.g., DE12) and binary columns to identify the hopping points (e.g., Hops_S21).
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

mcscf_coefs()

Collect the squared coefficients of the MCSCF wavefunction.

Note

This method works only for NAMD trajectories generated with Columbus.

Returns:: a dataset with the three highest MCSCF coefficients (the three largest) for each electronic state read from the NX output per state; if the class variable dataset has been already updated with some properties, the population data will be merged with the existing dataset.
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

nac_norm()

Calculate the norm of the nonadiabatic coupling matrices for each pair of states.

After running this function, the class variable dataset will be updated with the loaded NACs norm data.

Returns:: a dataset with the Frobenius norm of the nonadiabatic coupling matrices of all NX trajectories calculated for each possible transition between states; the number of columns in the NACs norm dataset is given by n_states * (n_states - 1)/2.
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

num_states

nx_version

oscillator_strength()

Collect the oscillator strength from properties file.

Note

The oscillator strength (OSS) is not always available in the output. If needed, check the Newton-Xdocumentation to see what are the required keywords and methods. The OSS data can be read from either classical or new series NX trajectories.

Returns:: a dataset with the oscillator strength information read from all available NX trajectories with one column for each transition between states; if the class variable dataset has been already updated with some properties, the oscillator strength data will be merged with the existing properties dataset.
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

populations()

Read and calculate the population for each accessible state.

Note

If the MD simulations were performed with the classical Newton-X, the populations will be calculated using the wavefunction coefficients. In the new Newton-X, the populations are already calculated and available in the .h5 file.

Returns:: a dataset with the populations obtained from all available NX trajectories with one column per state; if the class variable dataset has been already updated with some properties, the population data will be merged with the existing dataset.
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

property save_csv: None

Save the dataset with QM properties read from the NX trajectories.

The following properties are included in the dataset:

trajectory index;
simulation time;
total energy;
energy gaps between states (eV);
oscillator strength (if available);
states population.
norm of the nonadiabatic coupling matrices (if available);
three highest MCSCF coefficients per state (only for NX/Columbus);

trajectories

class ulamdyn.GetVelocities(n_atoms=None)

Bases: BaseClass

Class used to collect velocities from all NAMD trajectories generated by Newton-X.

build_dataframe(save_csv=False) → None

Generate a dataset (pandas.DataFrame object) containing all velocities.

The XYZ matrices with the velocities of each molecular geometry is flattened into a one dimensional vector, such that every row of the dataset corresponds to one step of the MD trajectories.

Parameters:: save_csv (bool, optional) – if True, the dataframe will be exported in a csv format, defaults to False. The default name of the saved dataset is all_velocities.csv.

dataset

classmethod from_all_trajs(n_atoms=None)

Wrapper function to collect the MD velocities without instantiating the class.

Parameters:: n_atoms (int, optional) – define the total number of atoms to consider in the velocities dataset, defaults to None. If None, the value will be taken from the geom.xyz file.
Returns:: the full velocities data set given as a tensor of shape (n_steps, n_atoms, 3).
Return type:: numpy.ndarray

from_dyn(outfile: str, tmax: float = 100000) → ndarray

Read velocities from a single trajectory of Newton-X.

Note

The final velocities dataset will contains information only for the first set of n_atoms, as defined by the class attribute n_atoms.

Parameters:

outfile (str) – name of the NX output file containing the velocities, dyn.out.
tmax (float) – Maximum time step to read from the NX output of a given trajectory.

Returns:

tensor of shape (n_steps, n_atoms, 3) with all velocities data.

Return type:

numpy.ndarray

from_h5(outfile: str) → ndarray

Read velocities from the .h5 file generated by the new Newton-X.

Note

The full velocities dataset will be sliced to select only the first set of n_atoms, as defined by the class attribute n_atoms.

Parameters:: outfile (str) – name of the NX output file, usually dyn.h5.
Returns:: tensor of shape (n_steps, n_atoms, 3) with all velocities data.
Return type:: numpy.ndarray

n_atoms

read_all_trajs()

Concatenate the XYZ velocities read from all available MD trajectories.

After running this method, the class attribute veloc will be updated with the full dataset of velocities read from the Newton-X output files.

trajectories

veloc

class ulamdyn.KineticEnergy(n_atoms=None)

Bases: object

Class used to calculate the atomic or molecular components of the MD kinetic energy.

atom_labels

atom_mass

build_dataframe(discretization_level: str = 'atom', n_mols_per_type: int = None, n_atoms_per_mol: int = None, save_csv: bool = False) → DataFrame

Create a dataset containing the kinetic energies splitted by atoms or molecules.

Parameters:

discretization_level (str, optional) – how to split the kinetic energy, defaults to “atom”
n_mols_per_type (int, optional) – number of molecules per subsystem, typically one solute and several solvent molecules as in QM/MM, defaults to None.
n_atoms_per_mol (int, optional) – number of atoms in each molecular subsystem, defaults to None.
save_csv (bool, optional) – export the dataset with the computed kinetic energies as a csv file, defaults to False

Returns:

a structured dataset with the values of kinetic energies computed for all trajectories, where each row corresponds to one time step of a given trajectory and the columns refer to the discretization level (atoms or molecules).

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

calc_per_atom() → ndarray

Calculate the kinetic energy of each atom for all MD trajectories available.

Returns:: matrix with the kinetic energy of the selected atoms arranged

in columns. :rtype: numpy.ndarray

calc_per_molecule(n_mols_per_type: list, n_atoms_per_mol: list) → ndarray

Sum the atomic contributions of kinetic energy for each molecule.

Note

In the current version, this function only supports two different types of molecules in the geom file, where the first set of atoms in the geom file should always correspond to the solute. The function will be generalized in future versions to support any number of different molecules.

Parameters:

n_mols_per_type (list) – give the number of molecules in each one of the two subsystems, typically one solute and several solvent molecules as in QM/MM.
n_atoms_per_mol (list) – how many atoms exist in each molecular subsystem.

Returns:

a matrix of shape (n_samples, n_mols_1 + n_mols_2) with the kinetic energy of each molecule in the system arranged in columns.

Return type:

numpy.ndarray

energies

n_atoms

trajectories

class ulamdyn.NormalModeAnalysis

Bases: GetCoords

dataset

eq_xyz

labels

make_stats(by: list = ['time'], time_intervals: list = [])

rmsd

run(molden_ref: str = 'freq.molden')

Perform the Normal Mode analysis over all MD geometries.

Parameters:: molden_ref (str, optional) – Name of molden file containing the calculated frequencies for the reference geometry, defaults to freq.molden.

save_csv() → None: Alias method to save all the datasets into a csv file

traj_time

trajectories

xyz

class ulamdyn.R2(all_geoms=None, use_mwc=False)

Bases: GetCoords

Class used to convert the XYZ coordinates of molecular geometries into R2-type of descriptors.

The R2 descriptor is defined as the (flattened) matrix of all pairwise Euclidean distances between all atoms in the molecule. Since the matrix is symmetric with respect to the interchange of atom indices (i.e., Dij = Dji) only the lower triangular portion of the R2 matrix will be outputed in the final data set.

This class also provides a method to compute other molecular descriptors derived from the R2 distance matrix. They are:

inverse R2 -> defined as \(1/R_{ij}\), similarly to the Coulomb
matrix descriptor.
delta R2 -> difference between the R2 vector of the current geometry in
time t and the equivalent R2 vector of a reference geometry (typically the ground-state geometry), \(R_{ij}(t) - R_{ij}(ref)\).
RE -> R2 vector normalized relative to equilibrium geometry,
\(R_{ij}(eq)/R_{ij}(t)\). For more details, check the reference J. Chem. Phys. 146, 244108 (2017).

apply_to_delta

build_descriptor(variant=None, apply_to_delta=None, save_csv=False)

Generate a dataframe with R2-based descriptors for all geometries.

Parameters:

variant (str, optional) – molecular representation derived from the R2 descriptor, defaults to None.
apply_to_delta (str, optional) – select a nonlinear function to apply as a transformation (sigmoid or hyperbolic tangent) on delta-R2 descriptors, defaults to None.
save_csv (bool, optional) – if True export the data set with all calculated descriptors in a csv format with name all_geoms_r2.csv, defaults to False.

Returns:

a dataframe object of shape (n_samples, n_atoms * (n_atoms - 1)/2), where each row is a vector with the R2-based descriptor computed for a given molecular geometry.

Return type:

pandas.DataFrame | modin.pandas.dataframe.DataFrame

inverse_transform(descriptor): Apply all transformations in reverse order to recover the original R2 vector.

mass_weights

norm_factor

r2_descriptor

r2_ref_geom

static reconstruct_xyz(r2_vec: ndarray) → ndarray

Reconstruct the XYZ coordinates from the pairwise atom distance vector.

Parameters:: r2_vec (numpy.ndarray) – Numpy 2D array of shape (n_samples, n_atoms * (n_atoms - 1)/2) that contains stacked R2 vectors.
Returns:: Numpy 3D array of shape (n_samples, n_atoms, 3) containing the molecular geometries transformed back into the Cartesian coordinate space.
Return type:: numpy.ndarray

variant

static xyz_to_distances(xyz_matrix: ndarray) → ndarray

Calculate the pairwise distance matrix for a given XYZ geometry.

Parameters:: xyz_matrix (numpy.ndarray) – Cartesian coordinates of a molecular structure given as a matrix of shape (n_atoms, 3).
Returns:: vector of size \(n_{atoms} (n_{atoms} - 1)/2\) containing the lower triangular portion of the R2 matrix.
Return type:: numpy.ndarray

class ulamdyn.RingParams(ring_atom_ind: list, ring_coords: ndarray = None)

Bases: GetCoords

Class used to the Cremer-Pople parameters from ring coordinates.

build_dataframe(save_csv=False): Construct a data frame containing the Cremer-Pople parameters of all collected geometries.

dataset

displacement(coords: ndarray): Calculate the ring displacement (z)

eq_xyz

get_conf_5memb(phi) → str: Determine the class of a 5-membered ring deformation based on the CP parameters.

get_conf_6memb(theta, phi) → str: Determine the conformation class for a 6-membered ring based on the CP parameters.

get_pucker_coords(coords: ndarray) → ndarray: Calculate the Cremer-Pople puckering parameters for one ring.

labels

property ring_coords: ndarray: Filter the XYZ coordinates corresponding to the selected ring and translate the coordinates to the ring center.

rmsd

traj_time

trajectories

xyz

class ulamdyn.SOAPDescriptor(all_geoms=None, atoms=None, r_cut=14, n_max=8, l_max=6, average='outer', rbf='polynomial')

Bases: GetCoords

Class used to generate SOAP (Smooth Overlap of Atomic Positions) descriptors for molecular geometries.

This class extends the GetCoords class, allowing it to handle molecular geometries and generate SOAP descriptors based on the atomic coordinates and species extracted from those geometries.

atoms

create_features() → DataFrame

Generate SOAP descriptors for the provided molecular geometries.

Returns:: Array of SOAP descriptors for each molecular geometry.
Return type:: numpy.ndarray

soap

species

class ulamdyn.VibrationalSpectra(n_atoms=None)

Bases: GetVelocities

Calculate the vibrational density of states for each MD trajectory.

build_dataframe(save_csv=False)

Create a dataset with information of vibrational spectra per MD trajectory.

Parameters:: save_csv (bool, optional) – if True, export the dataset with the computed spectra to a csv file, defaults to False.
Returns:: two-columns dataframe storing the vibrational frequencies (cm^-1) in the first columns and the density of states in the second one (a.u.)
Return type:: pandas.DataFrame | modin.pandas.dataframe.DataFrame

calc_all_trajs(mass_weighted=True): Calculate the vibrational spectra for each MD trajectory.

static calc_pdos(Vel: ndarray, dt: float, mass: ndarray = None) → ndarray

Calculate the power spectrum of the velocity auto-correlation function.

Parameters:

Vel (numpy.ndarray) – tensor of shape (n_steps, n_atoms, 3) with atomic velocities collected from a single MD trajectory.
dt (numpy.ndarray) – time step used to integrate the classical equations.
mass – if provided, rescale the velocities by the atomic mass of each specie; the default is None.

Returns:

two-columns array containing the calculated frequencies (cm^-1) in the first columns and the density of states in the second one (a.u.)

Return type:

numpy.ndarray

dataset

n_atoms

trajectories

veloc

class ulamdyn.ZMatrix(all_geoms=None)

Bases: GetCoords

Class used to generate molecular descriptors using internal coordinates (Z-Matrix).

This class does not require arguments in its constructor. All quantities related to distances are given in angstrom, while the features derived from angles are provided in degrees.

In addition to the standard Z-Matrix, the class also provides a method to compute other variants of the Z-Matrix molecular descriptors:

delta Z-Matrix -> difference between the Z-Matrix representation of the current geometry in time t and the Z-Matrix of a reference geometry.
tanh Z-Matrix -> hyperbolic tangent transformation on all features of delta Z-Matrix.
sig Z-Matrix -> sigmoid transformation on all features of delta Z-Matrix.

Data attributes:

distancematrix (numpy.ndarray): stores the full matrix of bond
distances for all geometries.

connectivity (list): indices of connected atoms based on a distance
criterion of proximity.

angleconnectivity (list): indices of three neighboring atoms to
compute angles.

dihedralconnectivity (list): four indices of neighboring atoms to
calculate dihedrals.

zmat_ref_geom (numpy.ndarray): stores the Z-matrix calculated for
the reference geometry.

angleconnectivity

build_descriptor(delta=False, apply_to_delta=None, save_csv=False)

Construct the standard Z-Matrix descriptor and other variants.

Note

By default, the algorithm will calculate the three main components of the Z-Matrix: bond distances, angles and dihedrals. An augmented version of the Z-Matrix descriptor can be also obtained by calculating additional distances, angles, dihedrals and/or bending angles using the methods provided in the class.

Parameters:

delta (bool, optional) – if True, the Z_matrix feature vector of each geometry will be subtracted from the Z_Matrix of the reference geometry, defaults to False.
apply_to_delta (str, optional) – select a nonlinear function to apply as a transformation (sigmoid or hyperbolic tangent) on the delta Z-matrix, defaults to None.
save_csv (bool, optional) – if true save a single csv file named all_geoms_zmatrix.csv containing the Z-Matrix descriptors computed for all geometries available the MD trajectories., defaults to False.

Returns:

a dataframe object with the (flattened) Z-Matrix descriptors stacked for all MD geometries.

Return type:

pandas.DataFrame

connectivity

dihedralconnectivity

distancematrix

static get_angle(geom: ndarray, idx_atoms: list) → float64

Calculate the angle formed by three selected atoms.

Parameters:

geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.
idx_atoms (list) – a list of three indices corresponding to the atoms for which the angle will be calculated.

Returns:

angle (in degrees) between three selected atoms.

Return type:

numpy.float

static get_bending(geom: ndarray, idx_atoms: list) → float64

Calculate the bending angle between two planes of the molecule.

This method is particularly useful to describe large out-of-plane distortions in the molecular structure that involves more than four atoms. The bending angle is calculated by first defining two vectors, each one perpendicular to different molecular planes formed by two sets of three atoms. Then, the angle between the two vectors is obtained by calculating the inverse cosine of the scalar product between these vectors.

Note

By default, the bending angle is not used to construct the Z-Matrix descriptor. It can be used to construct an augmented version of the Z-Matrix that better captures changes in the molecular structure during the dynamics.

Parameters:

geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.
idx_atoms (list) – a list of lists with three atom indices in each, used to define two molecular planes for which the angle will be calculated.

Returns:

bending angle (in degrees) defined by six specified atoms.

Return type:

np.float

static get_dihedral(geom: ndarray, idx_atoms: list) → float64

Calculate the dihedral angle formed by four selected atoms.

Parameters:

geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.
idx_atoms (list) – a list of four indices to select the atoms for which the dihedral angle will be calculated.

Returns:

dihedral angle (in degrees) formed by four specified atoms.

Return type:

numpy.float

static get_distance(geom: ndarray, idx_atoms: list) → float64

Calculate the Euclidean distance between a pair of atoms.

Parameters:

geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.
idx_atoms (list) – a pair of indices corresponding to the atoms for which the distance will be calculated.

Returns:

Euclidean distance (in Angstrom) between two selected atoms.

Return type:

numpy.float

zmat_ref_geom