ulamdyn package¶
Comprehensive API documentation with detailed information on how to use the functions and/or classes of ULaMDyn package. This is automatically generated from source code and comments.
Submodules¶
ulamdyn.data_loader module¶
- class ulamdyn.data_loader.GetCoords¶
Bases:
object
Class used to read the Cartesian coordinates from Newton-X MD trajectories.
Note
The Cartesian XYZ coordinates can be read either from dyn.out/dyn.xyz files generated by the classical NX or from the h5 file of new NX. Repeated geometries will be skipped.
It also provides a function to calculate the root-mean-squared deviation (RMSD) between each geometry read from the MD trajectories and a reference geometry. Before calculating the RMSD, the two geometries are aligned using the Kabsch algorithm (https://en.wikipedia.org/wiki/Kabsch_algorithm).
This class does not require arguments in its constructor. All the outputs generate by the class are given in angstroms.
trajectories
(list): trajectories ID (TRAJXX) available in the working directory.labels
(numpy.ndarray): stores the sequence of atom labels.eq_xyz
(numpy.ndarray): stores the XYZ matrix of the reference geometry (geom.xyz).xyz
(numpy.ndarray): stores the XYZ matrices of all geometries read from the TRAJ directories.rmsd
(numpy.ndarray): vector of RMSD values between all geometries and the reference one.dataset
(pandas.dataframe): stores a dataframe of the flattened XYZ matrices.- property align_geoms: None¶
Calculate the RMSD between the current and reference geometries.
Note
Before calculating the RMSD, the method uses the Kabsch algorithm to find the optimal alignment between the loaded molecular geometry for each time step t and the reference geometry.
The calculated RMSD values will be stored in the class attribute
rmsd_values
, while thexyz
attribute will be updated with the aligned geometries.
- build_dataframe() None ¶
Create a pandas DataFrame containing the XYZ coordinates from all trajectories.
After running this function, the class attribute
dataset
will be updated with the loaded DataFrame object.
- dataset¶
- eq_xyz¶
- from_dyn(outfile) Tuple[numpy.ndarray, numpy.ndarray] ¶
Get XYZ coordinates from the dyn.out file of classical Newton-X.
- Parameters
outfile (str) – name of the NX output file, dyn.out
- Returns
tuple containing an array of strings defining the atom labels, and a tensor of size (n_steps, n_atoms, 3) with all XYZ coordinates of a single MD trajectory.
- Return type
tuple in the form (numpy.ndarray, numpy.ndarray)
- static from_h5(outfile) Tuple[numpy.ndarray, numpy.ndarray] ¶
Read the XYZ coordinates from the .h5 file generated by the new Newton-X.
- Parameters
outfile (str) – name of the NX output file, usually dyn.h5
- Returns
tuple containing an array of strings defining the atom labels, and a tensor of shape (n_steps, n_atoms, 3) with all XYZ coordinates of a single MD trajectory.
- Return type
tuple in the form (numpy.ndarray, numpy.ndarray)
- static from_xyz(xyzfile) Tuple[numpy.ndarray, numpy.ndarray] ¶
Get XYZ coordinates from the dyn.xyz file of classical Newton-X.
This method has the same parameters and return type as in
from_dyn()
.
- labels¶
- read_all_trajs() None ¶
Concatenate the XYZ coordinates read from all available MD trajectories.
After running this method, the class attributes
labels
andxyz
will be updated.
- read_eq_geom() None ¶
Read the XYZ coordinates of a reference geometry.
Note
This method should be executed before calculating the RMSD with the
align_geoms()
function.A file with name geom.xyz must be provided in the working directory (TRAJECTORIES). After reading the coordinates, the method will update the class attributes
labels
andeq_xyz
.
- rmsd¶
- property save_csv: None¶
Save all loaded geometries (raw format) into a csv file.
If the RMSD has been calculated, it will be included as an extra column in the XYZ coordinates data set.
The default name of the output file is all_coordinates.csv.
- trajectories¶
- xyz¶
- class ulamdyn.data_loader.GetCouplings¶
Bases:
object
Class used to read the Nonadiabatic Coupling Vectors (NAC) from the MD trajectories.
This class does not require arguments in its constructor. The outputs generate by the class is given in atomic units.
trajectories
(list): trajectories ID (TRAJXX) available in the working directory.all_nacs
(dict): store the NAC matrices (per state) for all MD trajectories.dataset
(dict): dictionary of dataframes with the flattened gradient matrices.- all_nacs¶
- build_dataframe(save_csv=False) None ¶
Generate a dataset (pandas.DataFrame object) with all NACs.
The XYZ matrices with the nonadiabatic couplings obtained for each molecular geometry is flattened into a one dimensional vector. In this way, every row of the dataset corresponds to one step of a given MD trajectory.
- Parameters
save_csv (bool, optional) – if True enable the output of all NAC dataframes in a csv format, defaults to False. The default name of the outputed dataset is all_nacs_s[n].csv, where n is the number of the state.
- datasets¶
- static from_h5(outfile) dict ¶
Read NACs from one trajectory generated by the new Newton-X.
- Parameters
outfile (str) – name of the NX output file in the .h5 format, usually dyn.h5.
- Returns
stacked nonadiabatic coupling vectors (NAC) for all steps of a single MD trajectory, provided as a dictionary of tensors (numpy.ndarray, one for each state) with shape (n_steps, n_atoms, 3).
- Return type
- static from_nxlog(outfile) dict ¶
Read NACs from one trajectory generated by the classical Newton-X.
- trajectories¶
- class ulamdyn.data_loader.GetGradients¶
Bases:
object
Class used to read the QM gradients from Newton-X MD trajectories.
This class does not require arguments in its constructor. The outputs generate by the class is given in eV/angstrom.
trajectories
(list): trajectories ID (TRAJXX) available in the working directory.all_grads
(dict): store the gradient matrices (per state) for all MD trajectories.dataset
(dict): dictionary of dataframes with the flattened gradient matrices.- all_grads¶
- build_dataframe(save_csv=False) None ¶
Generate a dataset (pandas.DataFrame object) with all gradients.
The XYZ matrices with the gradients of each molecular geometry is flattened into a one dimensional vector, such that every row of the dataset corresponds to one step of the MD trajectories.
- Parameters
save_csv (bool, optional) – if True enable the output of all gradient dataframes in a csv format, defaults to False. The default name of the outputed dataset is all_gradients_s[n].csv, where n is the number of the state.
- datasets¶
- static from_h5(outfile) dict ¶
Read XYZ gradients from a single trajectory generated by the new Newton-X.
- static from_nxlog(outfile) dict ¶
Read XYZ gradients from one trajectory generated by the old Newton-X.
This method has the same parameters and return type as in
from_h5()
.
- trajectories¶
- class ulamdyn.data_loader.GetProperties¶
Bases:
object
Class used to read all properties available in the Newton-X MD trajectories.
Note
This class does not require arguments in its constructor. All the energy quantities processed by the class are transformed from Ha to eV. For the other properties, the original units used in Newton-X are kept.
trajectories
(list): trajectories ID (TRAJXX) available in the working directory.dataset
(pd.dataframe): store a dataframe with all available properties.num_states
(int): keep track of the number of states considered in the MD simulation.nx_version
(str): identify the Newton-X version, can be either cs (classical series) or ns (new series).- dataset¶
- energies()¶
Read / process the energy information from the en.dat (classical NX) or .h5 (new NX) file.
- Returns
a processed dataset with the information of all trajectories stacked, and containing the following columns “TRAJ”, “time”, “State”, “Total_Energy” plus the energy gaps between the accessible states (e.g., DE12) and binary columns to identify the hopping points (e.g., Hops_S21).
- Return type
- mcscf_coefs()¶
Collect the squared coefficients of the MCSCF wavefunction.
Note
This method works only for MD trajectories generated with Columbus.
- Returns
a dataset with the three highest MCSCF coefficients for each electronic state read from the NX output per state; if the class variable
dataset
has been already updated with some properties, the population data will be merged with the existing dataset.- Return type
- nac_norm()¶
Calculate the norm of the nonadiabatic coupling matrices for each pair of states.
After running this function, the class variable
dataset
will be updated with the loaded NACs norm data.- Returns
a dataset with the Frobenius norm of the nonadiabatic coupling matrices of all NX trajectories calculated for each possible transition between states; the number of columns in the NACs norm dataset is given by n_states * (n_states - 1)/2.
- Return type
- num_states¶
- nx_version¶
- oscillator_strength()¶
Collect the oscillator strength from properties file.
Note
The oscillator strength (OSS) is not always available in the output. If needed, check the Newton-Xdocumentation to see what are the required keywords and methods. The OSS data can be read from either classical or new series NX trajectories.
- Returns
a dataset with the oscillator strength information read from all available NX trajectories with one column for each transition between states; if the class variable
dataset
has been already updated with some properties, the oscillator strength data will be merged with the existing properties dataset.- Return type
- populations()¶
Read and calculate the population for each accessible state.
Note
If the MD simulations were performed with the classical Newton-X, the populations will be calculated using the wavefunction coefficients. In the new Newton-X, the populations are already calculated and available in the .h5 file.
- Returns
a dataset with the populations obtained from all available NX trajectories with one column per state; if the class variable
dataset
has been already updated with some properties, the population data will be merged with the existing dataset.- Return type
- property save_csv: None¶
Save the dataset with all QM properties read from the Newton-X trajectories.
The following properties are included in the dataset:
trajectory index;
simulation time;
total energy;
energy gaps between states (eV);
oscillator strength (if available);
states population.
norm of the nonadiabatic coupling matrices (if available);
three highest MCSSCF coefficients per state (only for NX/Columbus);
- trajectories¶
ulamdyn.data_writer module¶
- class ulamdyn.data_writer.Geometries(atom_labels, add_properties=[])¶
Bases:
object
Handle and save XYZ coordinates for selected frames of MD trajectories.
- save_xyz(geoms_array, properties_data=None, out_name='selected_geoms.xyz')¶
Save an XYZ file for a set of selected molecular geometries.
Note
The comment line of the XYZ file will contain a list of property values for each molecule.
- Parameters
geoms_array (numpy.ndarray) – a 3D array containing the list of XYZ matrices.
properties_data (pandas.DataFrame) – dataframe containing the property values of the selected geometries, defaults to None.
out_name (str) – name of the XYZ file containing all the selected geometries, defaults to selected_geoms.xyz.
ulamdyn.descriptors module¶
Classes and methods used to ML descriptors from molecular geometries.
- class ulamdyn.descriptors.R2¶
Bases:
ulamdyn.data_loader.GetCoords
Class used to convert the XYZ coordinates of molecular geometries into R2-type of descriptors.
The R2 descriptor is defined as the (flattened) matrix of all pairwise Euclidean distances between all atoms in the molecule. Since the matrix is symmetric with respect to the interchange of atom indices (i.e., Dij = Dji) only the lower triangular portion of the R2 matrix will be outputed in the final data set.
This class also provides a method to compute other molecular descriptors derived from the R2 distance matrix. They are:
inverse R2 -> defined as \(1/R_{ij}\), similarly to the Coulomb matrix descriptor.
delta R2 -> difference between the R2 vector of the current geometry in time t and the equivalent R2 vector of a reference geometry (typically the ground-state geometry), \(R_{ij}(t) - R_{ij}(ref)\).
RE -> R2 vector normalized relative to equilibrium geometry, \(R_{ij}(eq)/R_{ij}(t)\). For more details, check the reference J. Chem. Phys. 146, 244108 (2017).
- build_descriptor(all_geoms: numpy.ndarray, variant=None, save_csv=False)¶
Generate a dataframe with R2-based descriptors for all molecular geometries.
- Parameters
all_geoms (numpy.ndarray) – tensor of shape (nsamples, natoms, 3) containing the stacked XYZ coordinates read from all available MD trajectories.
variant (str, optional) – type of R2 descriptor, defaults to None.
save_csv (bool, optional) – if True export the data set with all calculated descriptors in a csv format with name pairwise_distances.csv, defaults to False.
- Returns
a dataframe object of shape (nsamples, natoms * (natoms - 1)/2), where each row is a vector with the R2-based descriptor computed for a given molecular geometry.
- Return type
- r2_descriptor¶
- r2_ref_geom¶
- static xyz_to_distances(xyz_matrix: numpy.ndarray) numpy.ndarray ¶
Calculate the pairwise distance matrix for a given XYZ geometry.
- Parameters
xyz_matrix (numpy.ndarray) – Cartesian coordinates of a molecular structure given as a matrix of shape (n_atoms, 3).
- Returns
vector of size \(n_{atoms} (n_{atoms} - 1)/2\) containing the lower triangular portion of the R2 matrix.
- Return type
- class ulamdyn.descriptors.ZMatrix¶
Bases:
ulamdyn.data_loader.GetCoords
Class used to generate molecular descriptors using internal coordinates (Z-Matrix).
This class does not require arguments in its constructor. All quantities related to distances are given in angstrom, while the features derived from angles are provided in degrees.
In addition to the standard Z-Matrix, the class also provides a method to compute other variants of the Z-Matrix molecular descriptors:
delta Z-Matrix -> difference between the Z-Matrix representation of the current geometry in time t and the Z-Matrix of a reference geometry.
tanh Z-Matrix -> hyperbolic tangent transformation on all features of delta Z-Matrix.
sig Z-Matrix -> sigmoid transformation on all features of delta Z-Matrix.
distancematrix
(numpy.ndarray): stores the full matrix of bond distances for all geometries.connectivity
(list): indices of connected atoms based on a distance criterion of proximity.angleconnectivity
(list): indices of three neighboring atoms to compute angles.dihedralconnectivity
(list): four indices of neighboring atoms to calculate dihedrals.zmat_ref_geom
(numpy.ndarray): stores the Z-matrix calculated for the reference geometry.- angleconnectivity¶
- build_descriptor(all_geoms: numpy.ndarray, delta=False, apply_to_delta=None, save_csv=False)¶
Construct the standard Z-Matrix descriptor and other variants.
Note
By default, the algorithm will calculate the three main components of the Z-Matrix: bond distances, angles and dihedrals. An augmented version of the Z-Matrix descriptor can be also obtained by calculating additional distances, angles, dihedrals and/or bending angles using the methods provided in the class.
- Parameters
all_geoms (numpy.ndarray) – tensor of shape (nsamples, natoms, 3) containing the stacked XYZ coordinates read from all available MD trajectories.
delta (bool, optional) – if True, the Z_matrix feature vector of each geometry will be subtracted from the Z_Matrix of the reference geometry, defaults to False.
apply_to_delta (str, optional) – select a nonlinear function to apply as a transformation (sigmooid or hyperbolic tangent) on the delta Z-matrix, defaults to None.
save_csv (bool, optional) – if true save a single csv file named all_geoms_zmatrix.csv containing the Z-Matrix descriptors computed for all geometries available the MD trajectories., defaults to False.
- Returns
a dataframe object with the (flattened) Z-Matrix descriptors stacked for all MD geometries.
- Return type
- connectivity¶
- dihedralconnectivity¶
- distancematrix¶
- static get_angle(geom: numpy.ndarray, idx_atoms: list) float ¶
Calculate the angle formed by three selected atoms.
- Parameters
geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.
idx_atoms (list) – a list of three indices corresponding to the atoms for which the angle will be calculated.
- Returns
angle (in degrees) between three selected atoms.
- Return type
numpy.float
- static get_bending(geom: numpy.ndarray, idx_atoms: list) float ¶
Calculate the bending angle between two planes of the molecule.
This method is particularly useful to describe large out-of-plane distortions in the molecular structure that involves more than four atoms. The bending angle is calculated by first defining two vectors each one perpendicular to different molecular planes formed by two sets of three atoms. Then, the angle between the two vectors is obtained by calculating the inverse cosine of the scalar product between these vectors.
Note
By default, the bending angle is not used to construct the Z-Matrix descriptor. It can be used to construct an augmented version of the Z-Matrix that better captures changes in the molecular structure during the dynamics.
- Parameters
geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.
idx_atoms (list) – a list of lists with three atom indices in each, used to define two molecular planes for which the angle will be calculated.
- Returns
bending angle (in degrees) defined by six specified atoms.
- Return type
np.float
- static get_dihedral(geom: numpy.ndarray, idx_atoms: list) float ¶
Calculate the dihedral angle formed by four selected atoms.
- Parameters
geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.
idx_atoms (list) – a list of four indices to select the atoms for which the dihedral angle will be calculated.
- Returns
dihedral angle (in degrees) formed by four specified atoms.
- Return type
numpy.float
- static get_distance(geom: numpy.ndarray, idx_atoms: list) float ¶
Calculate the Euclidean distance between a pair of atoms.
- Parameters
geom (numpy.ndarray) – matrix of shape (natoms, 3) storing the XYZ coordinates of a single molecule.
idx_atoms (list) – a pair of indices corresponding to the atoms for which the distance will be calculated.
- Returns
Euclidean distance (in Angstrom) between two selected atoms.
- Return type
numpy.float
- zmat_ref_geom¶
ulamdyn.interface module¶
- ulamdyn.interface.build_descriptor(args, getcoords_obj)¶
- ulamdyn.interface.get_kinetic_energies(n_atoms=None)¶
- ulamdyn.interface.get_properties_data(rmsd_vec=None)¶
- ulamdyn.interface.run_bootstrap(args)¶
- ulamdyn.interface.run_clustering(args)¶
- ulamdyn.interface.run_dim_reduction(args)¶
- ulamdyn.interface.save_data(data_to_save)¶
- ulamdyn.interface.save_xyz_hoppings(states_pair)¶
ulamdyn.kinetics module¶
- class ulamdyn.kinetics.GetVelocities(n_atoms=None)¶
Bases:
object
Class used to collect velocities from all MD trajectories of Newton-X.
- classmethod from_all_trajs(n_atoms=None)¶
Wrapper function to collect the MD velocities without instantiating the class.
- Parameters
n_atoms (int, optional) – define the total number of atoms to consider in the velocities dataset, defaults to None. If None, the value will be taken from the geom.xyz file.
- Returns
the full velocities data set given as a tensor of shape (n_steps, n_atoms, 3).
- Return type
- from_dyn(outfile: str) numpy.ndarray ¶
Read velocities from a single trajectory of Newton-X.
Note
The final velocities dataset will contains information only for the first set of n_atoms, as defined by the class attribute
n_atoms
.- Parameters
outfile (str) – name of the NX output file containing the velocities, dyn.out.
- Returns
tensor of shape (n_steps, n_atoms, 3) with all velocities data.
- Return type
- from_h5(outfile: str) numpy.ndarray ¶
Read velocities from the .h5 file generated by the new Newton-X.
Note
The full velocities dataset will be sliced to select only the first set of n_atoms, as defined by the class attribute
n_atoms
.- Parameters
outfile (str) – name of the NX output file, usually dyn.h5.
- Returns
tensor of shape (n_steps, n_atoms, 3) with all velocities data.
- Return type
- n_atoms¶
- read_all_trajs()¶
Concatenate the XYZ velocities read from all available MD trajectories.
After running this method, the class attribute
veloc
will be updated with the full dataset of velocities read from the Newton-X output files.
- trajectories¶
- veloc¶
- class ulamdyn.kinetics.KineticEnergy(n_atoms=None)¶
Bases:
object
Class used to calculate the atomic or molecular components of the MD kinetic energy.
- atom_labels¶
- atom_mass¶
- build_dataframe(discretization_level='atom', n_mols_per_type=None, n_atoms_per_mol=None, save_csv=False)¶
Create a dataset containing the kinetic energies splitted by atoms or molecules.
- Parameters
discretization_level (str, optional) – how to split the kinetic energy, defaults to “atom”
n_mols_per_type (int, optional) – number of molecules per subsystem, typically one solute and several solvent molecules as in QM/MM, defaults to None.
n_atoms_per_mol (int, optional) – number of atoms in each molecular subsystem, defaults to None.
save_csv (bool, optional) – export the dataset with the computed kinetic energies as a csv file, defaults to False
- Returns
a structured dataset with the values of kinetic energies computed for all trajectories, where each row corresponds to one time step of a given trajectory and the columns refer to the discretization level (atoms or molecules).
- Return type
- calc_per_atom() numpy.ndarray ¶
Calculate the kinetic energy of each atom for all MD trajectories available.
- Returns
matrix with the kinetic energy of the selected atoms arranged in columns.
- Return type
- calc_per_molecule(n_mols_per_type: list, n_atoms_per_mol: list) numpy.ndarray ¶
Sum the atomic contributions of kinetic energy for each molecule.
Note
In the current version, this function only supports two different types of molecules in the geom file, where the first set of atoms in the geom file should always correspond to the solute. The function will be generalized in future versions to support any number of different molecules.
- Parameters
- Returns
a matrix of shape (n_samples, n_mols_1 + n_mols_2) with the kinetic energy of each molecule in the system arranged in columns.
- Return type
- energies¶
- n_atoms¶
- trajectories¶
- class ulamdyn.kinetics.VibrationalSpectra(n_atoms=None)¶
Bases:
ulamdyn.kinetics.GetVelocities
Calculate the vibrational density of states for each MD trajectory.
- build_dataframe(save_csv=False)¶
Create a dataset with information of vibrational spectra per MD trajectory.
- Parameters
save_csv (bool, optional) – if True, export the dataset with the computed spectra to a csv file, defaults to False.
- Returns
two-columns dataframe storing the vibrational frequencies (cm^-1) in the first columns and the density of states in the second one (a.u.)
- Return type
- calc_all_trajs(mass_weighted=True)¶
Calculate the vibrational spectra for each MD trajectory.
- static calc_pdos(Vel, dt, mass=None)¶
Calculate the power spectrum of the velocity auto-correlation function.
- Parameters
Vel (numpy.ndarray) – tensor of shape (n_steps, n_atoms, 3) with atomic velocities collected from a single MD trajectory.
dt (numpy.ndarray) – time step used to integrate the classical equations.
mass – if provided, rescale the velocities by the atomic mass of each specie; the default is None.
- Returns
two-columns array containing the calculated frequencies (cm^-1) in the first columns and the density of states in the second one (a.u.)
- Return type
- n_atoms¶
- trajectories¶
- veloc¶
ulamdyn.statistics module¶
Module to perform statistical analysis of the MD datasets.
- ulamdyn.statistics.aggregate_data(data, vars_to_group=['time'])¶
Calculate the basic statistical descriptors for a given dataset.
The statistical quantities calculated by the function are mean and median to describe the central tendency, and standard deviation to measure the variability or dispersion of the data. So each feature of the original input dataset will be unfolded into three new columns identified with the suffixes ‘_median’, ‘_mean’ and ‘_std’.
- Parameters
data (pandas.DataFrame) – input dataset containing the information extracted from all MD trajectories.
vars_to_group (list, optional) – set of variables used by the function to group the data, defaults to [“time”]
- Returns
[description]
- Return type
- ulamdyn.statistics.bootstrap(dataframe, n_samples=None, n_repeats=1000, save_csv=False)¶
Estimate the mean by resampling the data with replacement.
- Parameters
dataframe (pandas.DataFrame | modin.pandas.dataframe.DataFrame) – input dataset having the quantities extrated from all available MD trajectories; the dataset must contain the ‘TRAJ’ and ‘time’ columns
n_samples (int, optional) – number of trajectories considered in each round of the resampling, defaults to None
n_repeats (int, optional) – number of resampling to be performed, defaults to 1000
save_csv (bool, optional) – if true outputs the final bootstrapped data in a csv file, defaults to False
- Returns
dataset containing all the bootstrapped data with estimated mean
- Return type
- ulamdyn.statistics.calc_avg_occupations(df)¶
Calculate the fraction of trajectories in each state as a function of time.
Note
The fraction of trajectories (occupation) is an important quantity to assess the quality of the surface hopping (SH) simulations. It should be compared to the population of each state averaged over all trajectories, which is provided by the method
aggregate_data()
. If the ensemble of SH trajectories is statistically converged, the occupation and the average population should match.- Parameters
df (pandas.DataFrame | modin.pandas.dataframe.DataFrame) – properties dataset with information collected from the available MD trajectories
- Returns
a new dataset with the occupations calculated for each state
- Return type
- ulamdyn.statistics.create_bootstrap_stats(boot_data, ci_level=95)¶
Generate the descriptive statistics for the bootstrapped data within a given confidence interval.
- Parameters
boot_data (pandas.DataFrame | modin.pandas.dataframe.DataFrame) – bootstrapped dataframe generate by the
bootstrap()
functionci_level (int, optional) – Size of the confidence interval to draw when aggregating the data (by time) to estimate the statistical descriptors (median, mean, std), defaults to 95
- Returns
dataframe grouped by the simulation time with each feature of the original dataset unfolded into three columns representing the statistical descriptors and other two columns storing the values for the lowest and highest variability as defined by the confidence interval
- Return type
- ulamdyn.statistics.create_stats(selected_data, save_csv=False)¶
Create datasets with the statistical summary of the requested quantities.
- Parameters
- Returns
one or several datasets containing the median, mean and standard deviation calculated for each feature of the input as a function of time; if the attribute is equal to ‘all’, the statistics will be calculated for the properties and descriptors (R2 and Z-Matrix) datasets.
- Return type
dict(pandas.DataFrame) | dict(modin.pandas.dataframe.DataFrame)
- ulamdyn.statistics.stats_hopping(dataframe)¶
Generate a statistical summary for the hopping points.
- Parameters
dataframe (pandas.DataFrame | modin.pandas.dataframe.DataFrame) – properties dataset with information collected from the available MD trajectories
- Returns
a new dataset containing the total number of hops per trajectory and the minimum and maximum time in which the hops between each pair of electronic states occur.
- Return type
ulamdyn.unsup_models module¶
Base classes and methods used to perform unsupervised learning analysis.
- class ulamdyn.unsup_models.Clustering(data, n_samples=None, scaler=None, random_state=42, n_cpus=- 1, verbosity=0)¶
Bases:
ulamdyn.unsup_models.Utils
Class used to find groups of similar geometries in the MD trajectories data.
- hierarchical(n_clusters=5, affinity='cosine', connectivity=None, linkage='single', distance_threshold=None, save_model=True)¶
Perform a hierarchical cluster analysis based on the agglomerative.
- Parameters
n_clusters (int, optional) – The number of clusters to find. It must be None if distance_threshold is not None, defaults to 5.
affinity (str, optional) – Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix (instead of a similarity matrix) is needed as input for the fit method, defaults to “cosine”.
connectivity (array-like or callable, optional) – Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix, such as derived from kneighbors_graph. Default is None, i.e, the hierarchical clustering algorithm is unstructured.
linkage (str, optional) – Define the linkage criterion to build the tree. It determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. The options are : + ‘ward’ -> minimizes the variance of the clusters being merged. + ‘average’ -> uses the average of the distances of each observation of the two sets. + ‘complete’ or ‘maximum’ -> uses the maximum distances between all observations of the two sets. + ‘single’ -> uses the minimum of the distances between all observations of the two sets. The default is “single”.
distance_threshold (float, optional) – The linkage distance threshold above which, clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True, defaults to None.
save_model (bool, optional) – Store the trained parameters of the model in a binary file, defaults to True.
- Returns
Dataframe of shape (n_samples,) with cluster labels for each data point.
- Return type
- kmeans(n_clusters=5, init='k-means++', n_init=500, max_iter=1000, convergence=1e-06, save_model=True)¶
Perform K-Means clustering.
- Parameters
n_clusters (int, list or str optional) – The number of clusters to form that corresponds also to the number of cluster centroids to generate, defaults to 5. If a list is passed, the k-means algorithm will be run for all n_clusters in the list, whereas if the argument is equal to ‘best’, consecutive runs will be performed with n_clusters varying in the range of [2, 15]. In both cases, the final results will be the best output labels with respect to the clustering performance on the silhouette and Calinski-Harabasz scores.
init (str or array, optional) – Method for initialization : + ‘k-means++’ -> selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. + ‘random’ -> choose n_clusters observations (rows) at random from data for the initial centroids. + If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. Defaults to “k-means++”.
n_init (int, optional) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of loss function, defaults to 500.
max_iter (int, optional) – Maximum number of iterations of the k-means algorithm for a single run, defaults to 1000
convergence (float, optional) – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence, defaults to 1e-06.
save_model (bool, optional) – Store the trained parameters of the model in a binary file, defaults to True.
- Returns
Dataframe of shape (n_samples,) with cluster labels for each data point.
- Return type
- spectral(n_clusters=5, n_components=10, n_init=100, affinity='rbf', gamma=0.01, n_neighbors=20, degree=3, coef0=1, kernel_params=None, save_model=True)¶
Apply clustering to a projection of the normalized Laplacian.
Note that Spectral Clustering is a highly expensive method due to the computation of the affinity matrix. Hence, this method is recommended only for small to medium size datasets (n_samples < 10000).
Note
This method is equivalent to kernel k-means (https://dl.acm.org/doi/10.1145/1014052.1014118). Spectral clustering is recommended for non-linearly separable dataset, where the individual clusters have a highly non-convex shape.
- Parameters
n_clusters (int, optional) –
The number of clusters to form which in this case corresponds to the dimension of the projection subspace. The default is 5.
If a list is passed, the k-means algorithm will be run for all n_clusters in the list, whereas if the argument is equal to ‘best’, consecutive runs will be performed with n_clusters varying in the range of [2, 15]. In both cases, the final results will be the best output labels with respect to the clustering performance on the silhouette and Calinski-Harabasz scores.
n_components (int, optional) – Number of eigenvectors to use for the spectral embedding, defaults to 10
n_init (int, optional) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Only used if assign_labels=’kmeans’. The default is 100.
affinity (str or callable, optional) – Method used to construct the affinity matrix. The available options are : + ‘nearest_neighbors’: construct the affinity matrix by computing a graph of nearest neighbors. + ‘rbf’: construct the affinity matrix using a radial basis function (RBF) kernel. + ‘precomputed_nearest_neighbors’: interpret X as a sparse graph of precomputed distances, and construct a binary affinity matrix from the n_neighbors nearest neighbors of each instance. + one of the kernels supported by pairwise_kernels. The default method is “rbf”.
gamma (float, optional) – Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for affinity=’nearest_neighbors’. Defaults to 0.01.
n_neighbors (int, optional) – Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for affinity=’rbf’, defaults to 20.
degree (int, optional) – Degree of the polynomial kernel. Ignored by other kernels. Defaults to 3.
coef0 (int, optional) – Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels. Defaults to 1.
kernel_params (dict or str, optional) – Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels. Defaults to None.
save_model (bool, optional) – Store the trained parameters of the model in a binary file, defaults to True.
- Returns
Dataframe of shape (n_samples,) with cluster labels for each data point.
- Return type
- class ulamdyn.unsup_models.DimensionalityReduction(data, n_samples=None, scaler=None, random_state=42, n_cpus=- 1)¶
Bases:
ulamdyn.unsup_models.Utils
Class used to find a low dimensional representation of MD trajectories data.
- isomap(n_components=2, n_neighbors=10, neighbors_algorithm='auto', metric='cosine', p=2, metric_params=None, calc_error=False)¶
Perform a nonlinear dimensionality reduction through Isometric Mapping.
- Parameters
n_components (int, optional) – Number of coordinates (features) for the low-dimensional manifold, defaults to 2.
n_neighbors (int, optional) – Number of neighbors to consider around each point, defaults to 10.
neighbors_algorithm (str, optional) – Method used for nearest neighbors search, defaults to “auto”
metric (str or callable, optional) – The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by sklearn.metrics.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. Defaults to “cosine”.
p (int, optional) – Parameter for the Minkowski metric from sklearn.metrics.pairwise pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. Defaults to 2.
metric_params (dict, optional) – Additional keyword arguments for the metric function. Defaults to None.
calc_error (bool, optional) – If True, the reconstruction error between the original and the projected data will be calculated, defaults to False.
- Returns
a new dataset with the transformed values where the coordinates of the low-dimensional manifold are stored in columns.
- Return type
- kpca(n_components=2, kernel='rbf', gamma=None, degree=4, coef0=1, kernel_params=None, alpha=1.0, fit_inverse_transform=False)¶
Perform a nonlinear dimensionality reduction using kernel PCA.
- Parameters
n_components (int, optional) – Number of components (features) to keep after KPCA transformation, defaults to 2.
kernel (str, optional) – Kernel function used in the transformation. The possible values are ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘cosine’ or precomputed’, defaults to “rbf”.
gamma (float, optional) – Kernel coefficient for rbf, poly and sigmoid kernels. Ignored by other kernels. If gamma is None, then it is set to 1/n_features. Defaults to None.
degree (int, optional) – Degree of polynomial kernel. Ignored by other kernels. Defaults to 4.
coef0 (int, optional) – Independent term in poly and sigmoid kernels. Ignored by other kernels. Defaults to 1.
kernel_params (dict, optional) – Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels. Defaults to None.
alpha (float, optional) – Hyperparameter of the ridge regression that learns the inverse transform (when fit_inverse_transform=True), defaults to 1.0.
fit_inverse_transform (bool, optional) – Hyperparameter of the ridge regression that learns the inverse transform (when fit_inverse_transform=True), defaults to False.
- Returns
a new dataset with the transformed values where the selected components are stored in columns.
- Return type
- pca(n_components=2, calc_error=False, save_errors=False)¶
Perform a linear dimensionality reduction using principal component analysis.
Note
By default the percentage of variance explained by each of the selected components will be printed after the PCA analysis.
- Parameters
n_components (int, optional.) – Number of principal components to keep, defaults to 2.
calc_error (bool, optional) – If True, the reconstruction error between the original and the projected data will be calculated, defaults to False.
save_errors (bool, optional) – If True, save to a csv file the reconstruction error calculated for each sample, defaults to False.
- Returns
a new dataset with the transformed values where the selected components are stored in columns.
- Return type
- tsne(n_components=2, perplexity=40.0, learning_rate=200.0, n_iter=2000, n_iter_without_progress=400, metric='euclidean', init='pca', verbose=1, method='barnes_hut')¶
Perform the t-distributed Stochastic Neighbor Embedding analysis.
- Parameters
n_components (int, optional) – Number of coordinates (features) for the low-dimensional embbeding, defaults to 2.
perplexity (float, optional) – This hyperparameter is used to control the attention between local and global aspects of the data, in a certain sense, by guessing the number of close neighbors each point has. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significantly different results. Defaults to 40.0.
learning_rate (float, optional) – The learning rate for t-SNE is usually in the range [10.0, 1000.0]. If the learning rate is too high, the data may look like a ‘ball’ with any point approximately equidistant from its nearest neighbours. If the learning rate is too low, most points may look compressed in a dense cloud with few outliers. If the cost function gets stuck in a bad local minimum increasing the learning rate may help. Defaults to 200.0
n_iter (int, optional) – Maximum number of iterations for the optimization. Should be at least 250. Defaults to 2000.
n_iter_without_progress (int, optional) – Maximum number of iterations without progress before we abort the optimization, used after 250 initial iterations with early exaggeration. Note that progress is only checked every 50 iterations so this value is rounded to the next multiple of 50. Defaults to 400.
metric (str or callable, optional) – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy. spatial.distance.pdist for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them. The default is “euclidean” which is interpreted as squared euclidean distance. Defaults to “euclidean”.
init (str, optional) – Initialization of embedding. Possible options are ‘random’, ‘pca’, and a numpy array of shape (n_samples, n_components). PCA initialization cannot be used with precomputed distances and is usually more globally stable than random initialization. Defaults to “pca”.
verbose (int, optional) – Verbosity level. Defaults to 1
method (str, optional) – By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples. Defaults to “barnes_hut”.
- Returns
a new dataset with the transformed values where the coordinates of the low-dimensional manifold are stored in columns.
- Return type
ulamdyn.utilities module¶
Auxiliary functions and constants used by the main modules.
- ulamdyn.utilities.get_labels_masses(traj_dir, n_atoms=None)¶
- ulamdyn.utilities.get_num_atoms()¶
- ulamdyn.utilities.get_nx_version(traj_dir)¶
- ulamdyn.utilities.get_traj_dirs()¶
- ulamdyn.utilities.read_h5_nx(traj)¶
- ulamdyn.utilities.read_nx_control(traj_dir)¶
Module contents¶
ULaMDyn is a python package built on top of sklearn designed to perform data preprocessing, statistical and unsupervised learning analysis of (non-adiabatic) molecular dynamics simulations.
ULaMDyn consists of five general modules:
Data_Loader
Data_Writer
Statistics
Kinetics
Descriptors
Unsup_Models
- ulamdyn.export(func)¶