hsr package
Submodules
hsr.fingerprint module
- hsr.fingerprint.compute_distances(molecule_data: ndarray, scaling=None)[source]
Calculate the Euclidean distance between each point in molecule_data and scaled reference points.
This function computes the distances between each data point in a molecule and a set of reference points. The reference points are scaled either by a factor or by a matrix depending on the type of the ‘scaling’ parameter.
Parameters
- molecule_datanp.ndarray
Data of the molecule with each row representing a point.
- scalingfloat, np.ndarray
The scaling applied to the reference points.
Returns
- np.ndarray
A matrix of distances, where each element [i, j] is the distance between the i-th molecule data point and the j-th reference point.
- hsr.fingerprint.compute_statistics(distances)[source]
Calculate statistical moments (mean, standard deviation, skewness) for the given distances.
Parameters
- distancesnp.ndarray
Matrix with distances between each point and each reference point.
Returns
- list
A list of computed statistics.
- hsr.fingerprint.generate_fingerprint_from_data(molecule_data: array, scaling='matrix', chirality=False)[source]
Generate a fingerprint directly from molecular data.
This function takes the data of a molecule, applies PCA transformation considering chirality if needed, and computes the fingerprint.
Parameters
- molecule_datanp.array
Data of the molecule, with each row representing a point.
- scalingstr, float, or np.ndarray
Specifies the scaling applied to reference points. If set to ‘matrix’ (default), a scaling matrix is automatically computed based on the PCA-transformed data. If a float is provided, it’s used as a scaling factor. If a numpy.ndarray is provided, it’s used as a scaling matrix.
- chiralitybool, optional
Consider chirality in PCA transformation if set to True.
Returns
- list or tuple
Fingerprint of the molecule, and dimensionality if chirality is considered.
- hsr.fingerprint.generate_fingerprint_from_molecule(molecule, features={'delta_neutrons': <function extract_neutron_difference_from_common_isotope>, 'formal_charges': <function extract_formal_charge>, 'protons': <function extract_proton_number>}, scaling='matrix', chirality=False, removeHs=False)[source]
Generate a fingerprint from a molecular structure using specified features and scaling.
This function processes an RDKit molecule object to generate its fingerprint. It first converts the molecule into n-dimensional data based on the specified features, optionally removing hydrogen atoms if specified. A PCA transformation is then performed, with an option to consider chirality. The reference points for distance calculation are scaled as per the provided scaling parameter, and the fingerprint is computed based on these distances.
Parameters
- moleculeRDKit Mol
RDKit molecule object.
- featuresdict, optional
Features to consider for molecule conversion. Default is DEFAULT_FEATURES.
- scalingstr, float, or np.ndarray
Specifies the scaling applied to reference points. If ‘matrix’, a scaling matrix is computed and applied. If a float, it is used as a scaling factor. If a numpy.ndarray, it is directly used as the scaling matrix.
- chiralitybool, optional
If True, chirality is considered in the PCA transformation, which can be important for distinguishing chiral molecules.
- removeHsbool, optional
If True, hydrogen atoms are removed from the molecule before conversion, focusing on heavier atoms.
Returns
- list or tuple
Fingerprint of the molecule. If chirality is considered, also returns the dimensionality post-PCA transformation.
- hsr.fingerprint.generate_fingerprint_from_transformed_data(molecule_data: ndarray, scaling)[source]
Compute a fingerprint from transformed molecular data.
This function generates a molecular fingerprint based on distance statistics. It calculates distances between the transformed molecular data points and a set of reference points that are scaled using the provided scaling parameter. The fingerprint is derived from these distance measurements.
Parameters
- molecule_datanp.ndarray
Transformed data of the molecule, each row representing a transformed point.
- scalingfloat, np.ndarray
The scaling applied to the reference points.
Returns
- list
Fingerprint derived from the distance measurements to scaled reference points.
- hsr.fingerprint.generate_reference_points(dimensionality, scaling=None)[source]
Generate reference points in the n-dimensional space.
Parameters
- dimensionalityint
The number of dimensions.
- scalingfloat, np.ndarray
The scaling applied to the reference points.
Returns
- np.ndarray
An array of reference points including the centroid and the points on each axis.
hsr.pca_transform module
- hsr.pca_transform.adjust_eigenvector_signs(original_data, eigenvectors, chirality=False, tolerance=1e-10)[source]
Adjust the sign of eigenvectors based on the data’s projections.
This function iterates through each eigenvector and determines its sign by examining the direction of the data’s maximum projection along that eigenvector. If the maximum projection is negative, the sign of the eigenvector is flipped. The function also handles special cases such as symmetric distributions of projections and can adjust eigenvectors based on chirality considerations.
Parameters
- original_datanumpy.ndarray
N-dimensional array representing a molecule, where each row is a sample/point.
- eigenvectorsnumpy.ndarray
Eigenvectors obtained from the PCA decomposition.
- chiralitybool, optional
If True, the function also considers the skewness of the projections to decide on flipping the eigenvector. This is necessary for distinguishing chiral molecules. Defaults to False.
- tolerancefloat, optional
Tolerance used when comparing projections. Defaults to 1e-4.
Returns
- eigenvectorsnumpy.ndarray
Adjusted eigenvectors with their sign possibly flipped.
- sign_changesint
The number of eigenvectors that had their signs changed.
- best_eigenvector_to_flipint
Index of the eigenvector with the highest skewness, relevant when chirality is considered. This is the eigenvector most likely to be flipped to preserve chirality.
- hsr.pca_transform.compute_pca_using_covariance(original_data, chirality=False, return_axes=False, print_steps=False)[source]
Perform Principal Component Analysis (PCA) using eigendecomposition of the covariance matrix.
This function conducts PCA on a given dataset to produce a consistent reference system, facilitating comparison between different molecules. It emphasizes generating eigenvectors that provide deterministic outcomes and consistent orientations. The function also includes an option to handle chiral molecules by ensuring a positive determinant for the transformation matrix.
Parameters
- original_datanumpy.ndarray
An N-dimensional array representing a molecule, where each row is a sample/point. The array should have a shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features.
- chiralitybool, optional
If set to True, the function ensures that the determinant of the transformation matrix is positive, allowing for the distinction of chiral molecules. Default is False.
- return_axesbool, optional
If True, returns the principal axes (eigenvectors) in addition to the transformed data. Default is False.
- print_stepsbool, optional
If True, prints the steps of the PCA process: covariance matrix, eigenvalues, eigenvectors and transformed data. Default is False.
Returns
- transformed_datanumpy.ndarray
The dataset after PCA transformation. This data is aligned to the principal components and is of the same shape as the original data.
- dimensionalityint
The number of significant dimensions in the transformed data. Only returnd if chirality is True.
- eigenvectorsnumpy.ndarray, optional
Only returned if return_axes is True. The principal axes of the transformation, represented as eigenvectors. Each column corresponds to an eigenvector.
- hsr.pca_transform.extract_relevant_subspace(eigenvectors, significant_indices, tol=1e-10)[source]
Extracts the subset of eigenvectors that’s relevant for the determinant calculation.
This function prunes eigenvectors by removing rows and columns that have all zeros except for a single entry close to 1 or -1 within a given tolerance (eigenvectors with an eigenvalue equal to 0, and relative components). Then, it further reduces the matrix using the provided significant indices to give a relevant subset of eigenvectors.
Parameters
- eigenvectorsnumpy.ndarray
The eigenvectors matrix to prune and reduce.
- significant_indicesnumpy.ndarray
Indices of significant eigenvectors.
- tolfloat, optional (default = 1e-10)
Tolerance for determining whether a value is close to 0, 1, or -1.
Returns
- numpy.ndarray
The determinant-relevant subset of eigenvectors.
hsr.pre_processing module
- hsr.pre_processing.load_molecules_from_sdf(path, removeHs=False, sanitize=False)[source]
Load a list of molecules from an SDF file.
Parameters
- pathstr
Path to the SDF file.
- removeHsbool, optional
Whether to remove hydrogens. Defaults to False.
- sanitizebool, optional
Whether to sanitize the molecules. Defaults to False.
Returns
- list of rdkit.Chem.rdchem.Mol
A list of RDKit molecule objects.
- hsr.pre_processing.molecule_to_ndarray(molecule, features={'delta_neutrons': <function extract_neutron_difference_from_common_isotope>, 'formal_charges': <function extract_formal_charge>, 'protons': <function extract_proton_number>}, removeHs=False)[source]
Generate a numpy array representing the given molecule in N dimensions.
This function converts a molecule into an N-dimensional numpy array based on specified features. Each feature is computed using a function defined in the ‘features’ dictionary.
Parameters
- moleculerdkit.Chem.rdchem.Mol
The input RDKit molecule object.
- featuresdict[str, callable], optional
A dictionary where each key is a feature name (str) and the value is a callable function to compute that feature. The function takes an RDKit atom object as input and returns a feature value (a numeric type). Defaults to DEFAULT_FEATURES.
- removeHs: bool, optional
If True, hydrogen atoms will not be included in the array representation. Defaults to False.
Returns
- numpy.ndarray
Array with shape (number of atoms, 3 spatial coordinates + number of features), representing the molecule.
- hsr.pre_processing.read_mol_from_file(path, removeHs=False, sanitize=False)[source]
General reader for molecules from files.
Parameters
- pathstr
Path to the file.
- removeHsbool, optional
Whether to remove hydrogens. Defaults to False.
- sanitizebool, optional
Whether to sanitize the molecules. Defaults to False.
Returns
- rdkit.Chem.rdchem.Mol
A RDKit molecule object.
hsr.similarity module
- hsr.similarity.calculate_manhattan_distance(moments1: list, moments2: list)[source]
Calculate the manhattan distance between two lists.
Parameters
- moments1list
The first list of numerical values.
- moments2list
The second list of numerical values, must be of the same length as moments1.
Returns
- float
The mean absolute difference between the two lists.
- hsr.similarity.calculate_similarity_from_distance(distance, n_components)[source]
Calculate similarity score from a distance score.
This function converts a distance score into a similarity score using a reciprocal function. The distance is first normalized by the number of components of the fingerprint. The similarity score approaches 1 as the difference score approaches 0, and it approaches 0 as the difference score increases.
Parameters
- partial_scorefloat
The difference score, a non-negative number.
- n_componentsint
The number of components in the fingerprint.
Returns
- float
The similarity score derived from the distance.
- hsr.similarity.compute_distance(mol1, mol2, features={'delta_neutrons': <function extract_neutron_difference_from_common_isotope>, 'formal_charges': <function extract_formal_charge>, 'protons': <function extract_proton_number>}, scaling='matrix', removeHs=False, chirality=False)[source]
Calculate the distance score between two molecules using their n-dimensional fingerprints.
This function generates fingerprints for two molecules based on their structures and a set of features, and then computes a distance score between these fingerprints.
Parameters
- mol1RDKit Mol
The first RDKit molecule object.
- mol2RDKit Mol
The second RDKit molecule object.
- featuresdict, optional
Dictionary of features to be considered. Default is DEFAULT_FEATURES.
- scalingstr, float, or np.ndarray
Specifies the scaling applied to reference points. If set to ‘matrix’ (default), a scaling matrix is automatically computed based on the PCA-transformed data. If a float is provided, it’s used as a scaling factor. If a numpy.ndarray is provided, it’s used as a scaling matrix.
- removeHsbool, optional
If True, hydrogen atoms are removed from the molecule before generating the fingerprint.
- chiralitybool, optional
Consider chirality in the generation of fingerprints if set to True.
Returns
- float
The computed distance score between the two molecules.
- hsr.similarity.compute_distance_from_ndarray(mol1_nd: array, mol2_nd: array, scaling='matrix', chirality=False)[source]
Calculate the distance score between two molecules represented as N-dimensional arrays.
This function computes fingerprints for two molecules based on their N-dimensional array representations and then calculates a distance score between these fingerprints.
Parameters
- mol1_ndnumpy.ndarray
The N-dimensional array representing the first molecule.
- mol2_ndnumpy.ndarray
The N-dimensional array representing the second molecule.
- scalingstr, float, or np.ndarray
Specifies the scaling applied to reference points. If set to ‘matrix’ (default), a scaling matrix is automatically computed based on the PCA-transformed data. If a float is provided, it’s used as a scaling factor. If a numpy.ndarray is provided, it’s used as a scaling matrix.
- chiralitybool, optional
Consider chirality in the generation of fingerprints if set to True.
Returns
- float
The computed distance score between the two molecules.
- hsr.similarity.compute_similarity(mol1, mol2, features={'delta_neutrons': <function extract_neutron_difference_from_common_isotope>, 'formal_charges': <function extract_formal_charge>, 'protons': <function extract_proton_number>}, scaling='matrix', removeHs=False, chirality=False)[source]
Calculate the similarity score between two molecules using their n-dimensional fingerprints.
This function generates fingerprints for two molecules based on their structures and a set of features, and then computes a similarity score between these fingerprints.
Parameters
- mol1RDKit Mol
The first RDKit molecule object.
- mol2RDKit Mol
The second RDKit molecule object.
- featuresdict, optional
Dictionary of features to be considered. Default is DEFAULT_FEATURES.
- scalingstr, float, or np.ndarray
Specifies the scaling applied to reference points. If set to ‘matrix’ (default), a scaling matrix is automatically computed based on the PCA-transformed data. If a float is provided, it’s used as a scaling factor. If a numpy.ndarray is provided, it’s used as a scaling matrix.
- removeHsbool, optional
If True, hydrogen atoms are removed from the molecule before generating the fingerprint.
- chiralitybool, optional
Consider chirality in the generation of fingerprints if set to True.
Returns
- float
The computed similarity score between the two molecules.
- hsr.similarity.compute_similarity_from_ndarray(mol1_nd: array, mol2_nd: array, scaling='matrix', chirality=False)[source]
Calculate the similarity score between two molecules represented as N-dimensional arrays.
This function computes fingerprints for two molecules based on their N-dimensional array representations and then calculates a similarity score between these fingerprints.
Parameters
- mol1_ndnumpy.ndarray
The N-dimensional array representing the first molecule.
- mol2_ndnumpy.ndarray
The N-dimensional array representing the second molecule.
- scalingstr, float, or np.ndarray
Specifies the scaling applied to reference points. If set to ‘matrix’ (default), a scaling matrix is automatically computed based on the PCA-transformed data. If a float is provided, it’s used as a scaling factor. If a numpy.ndarray is provided, it’s used as a scaling matrix.
- chiralitybool, optional
Consider chirality in the generation of fingerprints if set to True.
Returns
- float
The computed similarity score between the two molecules.
- hsr.similarity.compute_similarity_score(fingerprint_1: list, fingerprint_2: list)[source]
Calculate the similarity score between two fingerprints.
Parameters
- fingerprint_1list
The fingerprint of the first molecule.
- fingerprint_2list
The fingerprint of the second molecule.
Returns
- float
The computed similarity score.
hsr.utils module
- hsr.utils.compute_scaling_factor(molecule_data)[source]
Computes the largest distance between the centroid and the molecule data points