hsr package

Submodules

hsr.fingerprint module

hsr.fingerprint.compute_distances(molecule_data: ndarray, scaling=None)[source]

Calculate the Euclidean distance between each point in molecule_data and scaled reference points.

This function computes the distances between each data point in a molecule and a set of reference points. The reference points are scaled either by a factor or by a matrix depending on the type of the ‘scaling’ parameter.

Parameters

molecule_datanp.ndarray

Data of the molecule with each row representing a point.

scalingfloat, np.ndarray

The scaling applied to the reference points.

Returns

np.ndarray

A matrix of distances, where each element [i, j] is the distance between the i-th molecule data point and the j-th reference point.

hsr.fingerprint.compute_statistics(distances)[source]

Calculate statistical moments (mean, standard deviation, skewness) for the given distances.

Parameters

distancesnp.ndarray

Matrix with distances between each point and each reference point.

Returns

list

A list of computed statistics.

hsr.fingerprint.generate_fingerprint_from_data(molecule_data: array, scaling='matrix', chirality=False)[source]

Generate a fingerprint directly from molecular data.

This function takes the data of a molecule, applies PCA transformation considering chirality if needed, and computes the fingerprint.

Parameters

molecule_datanp.array

Data of the molecule, with each row representing a point.

scalingstr, float, or np.ndarray

Specifies the scaling applied to reference points. If set to ‘matrix’ (default), a scaling matrix is automatically computed based on the PCA-transformed data. If a float is provided, it’s used as a scaling factor. If a numpy.ndarray is provided, it’s used as a scaling matrix.

chiralitybool, optional

Consider chirality in PCA transformation if set to True.

Returns

list or tuple

Fingerprint of the molecule, and dimensionality if chirality is considered.

hsr.fingerprint.generate_fingerprint_from_molecule(molecule, features={'delta_neutrons': <function extract_neutron_difference_from_common_isotope>, 'formal_charges': <function extract_formal_charge>, 'protons': <function extract_proton_number>}, scaling='matrix', chirality=False, removeHs=False)[source]

Generate a fingerprint from a molecular structure using specified features and scaling.

This function processes an RDKit molecule object to generate its fingerprint. It first converts the molecule into n-dimensional data based on the specified features, optionally removing hydrogen atoms if specified. A PCA transformation is then performed, with an option to consider chirality. The reference points for distance calculation are scaled as per the provided scaling parameter, and the fingerprint is computed based on these distances.

Parameters

moleculeRDKit Mol

RDKit molecule object.

featuresdict, optional

Features to consider for molecule conversion. Default is DEFAULT_FEATURES.

scalingstr, float, or np.ndarray

Specifies the scaling applied to reference points. If ‘matrix’, a scaling matrix is computed and applied. If a float, it is used as a scaling factor. If a numpy.ndarray, it is directly used as the scaling matrix.

chiralitybool, optional

If True, chirality is considered in the PCA transformation, which can be important for distinguishing chiral molecules.

removeHsbool, optional

If True, hydrogen atoms are removed from the molecule before conversion, focusing on heavier atoms.

Returns

list or tuple

Fingerprint of the molecule. If chirality is considered, also returns the dimensionality post-PCA transformation.

hsr.fingerprint.generate_fingerprint_from_transformed_data(molecule_data: ndarray, scaling)[source]

Compute a fingerprint from transformed molecular data.

This function generates a molecular fingerprint based on distance statistics. It calculates distances between the transformed molecular data points and a set of reference points that are scaled using the provided scaling parameter. The fingerprint is derived from these distance measurements.

Parameters

molecule_datanp.ndarray

Transformed data of the molecule, each row representing a transformed point.

scalingfloat, np.ndarray

The scaling applied to the reference points.

Returns

list

Fingerprint derived from the distance measurements to scaled reference points.

hsr.fingerprint.generate_reference_points(dimensionality, scaling=None)[source]

Generate reference points in the n-dimensional space.

Parameters

dimensionalityint

The number of dimensions.

scalingfloat, np.ndarray

The scaling applied to the reference points.

Returns

np.ndarray

An array of reference points including the centroid and the points on each axis.

hsr.pca_transform module

hsr.pca_transform.adjust_eigenvector_signs(original_data, eigenvectors, chirality=False, tolerance=1e-10)[source]

Adjust the sign of eigenvectors based on the data’s projections.

This function iterates through each eigenvector and determines its sign by examining the direction of the data’s maximum projection along that eigenvector. If the maximum projection is negative, the sign of the eigenvector is flipped. The function also handles special cases such as symmetric distributions of projections and can adjust eigenvectors based on chirality considerations.

Parameters

original_datanumpy.ndarray

N-dimensional array representing a molecule, where each row is a sample/point.

eigenvectorsnumpy.ndarray

Eigenvectors obtained from the PCA decomposition.

chiralitybool, optional

If True, the function also considers the skewness of the projections to decide on flipping the eigenvector. This is necessary for distinguishing chiral molecules. Defaults to False.

tolerancefloat, optional

Tolerance used when comparing projections. Defaults to 1e-4.

Returns

eigenvectorsnumpy.ndarray

Adjusted eigenvectors with their sign possibly flipped.

sign_changesint

The number of eigenvectors that had their signs changed.

best_eigenvector_to_flipint

Index of the eigenvector with the highest skewness, relevant when chirality is considered. This is the eigenvector most likely to be flipped to preserve chirality.

hsr.pca_transform.compute_pca_using_covariance(original_data, chirality=False, return_axes=False, print_steps=False)[source]

Perform Principal Component Analysis (PCA) using eigendecomposition of the covariance matrix.

This function conducts PCA on a given dataset to produce a consistent reference system, facilitating comparison between different molecules. It emphasizes generating eigenvectors that provide deterministic outcomes and consistent orientations. The function also includes an option to handle chiral molecules by ensuring a positive determinant for the transformation matrix.

Parameters

original_datanumpy.ndarray

An N-dimensional array representing a molecule, where each row is a sample/point. The array should have a shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features.

chiralitybool, optional

If set to True, the function ensures that the determinant of the transformation matrix is positive, allowing for the distinction of chiral molecules. Default is False.

return_axesbool, optional

If True, returns the principal axes (eigenvectors) in addition to the transformed data. Default is False.

print_stepsbool, optional

If True, prints the steps of the PCA process: covariance matrix, eigenvalues, eigenvectors and transformed data. Default is False.

Returns

transformed_datanumpy.ndarray

The dataset after PCA transformation. This data is aligned to the principal components and is of the same shape as the original data.

dimensionalityint

The number of significant dimensions in the transformed data. Only returnd if chirality is True.

eigenvectorsnumpy.ndarray, optional

Only returned if return_axes is True. The principal axes of the transformation, represented as eigenvectors. Each column corresponds to an eigenvector.

hsr.pca_transform.extract_relevant_subspace(eigenvectors, significant_indices, tol=1e-10)[source]

Extracts the subset of eigenvectors that’s relevant for the determinant calculation.

This function prunes eigenvectors by removing rows and columns that have all zeros except for a single entry close to 1 or -1 within a given tolerance (eigenvectors with an eigenvalue equal to 0, and relative components). Then, it further reduces the matrix using the provided significant indices to give a relevant subset of eigenvectors.

Parameters

eigenvectorsnumpy.ndarray

The eigenvectors matrix to prune and reduce.

significant_indicesnumpy.ndarray

Indices of significant eigenvectors.

tolfloat, optional (default = 1e-10)

Tolerance for determining whether a value is close to 0, 1, or -1.

Returns

numpy.ndarray

The determinant-relevant subset of eigenvectors.

hsr.pre_processing module

hsr.pre_processing.load_molecules_from_sdf(path, removeHs=False, sanitize=False)[source]

Load a list of molecules from an SDF file.

Parameters

pathstr

Path to the SDF file.

removeHsbool, optional

Whether to remove hydrogens. Defaults to False.

sanitizebool, optional

Whether to sanitize the molecules. Defaults to False.

Returns

list of rdkit.Chem.rdchem.Mol

A list of RDKit molecule objects.

hsr.pre_processing.molecule_to_ndarray(molecule, features={'delta_neutrons': <function extract_neutron_difference_from_common_isotope>, 'formal_charges': <function extract_formal_charge>, 'protons': <function extract_proton_number>}, removeHs=False)[source]

Generate a numpy array representing the given molecule in N dimensions.

This function converts a molecule into an N-dimensional numpy array based on specified features. Each feature is computed using a function defined in the ‘features’ dictionary.

Parameters

moleculerdkit.Chem.rdchem.Mol

The input RDKit molecule object.

featuresdict[str, callable], optional

A dictionary where each key is a feature name (str) and the value is a callable function to compute that feature. The function takes an RDKit atom object as input and returns a feature value (a numeric type). Defaults to DEFAULT_FEATURES.

removeHs: bool, optional

If True, hydrogen atoms will not be included in the array representation. Defaults to False.

Returns

numpy.ndarray

Array with shape (number of atoms, 3 spatial coordinates + number of features), representing the molecule.

hsr.pre_processing.read_mol_from_file(path, removeHs=False, sanitize=False)[source]

General reader for molecules from files.

Parameters

pathstr

Path to the file.

removeHsbool, optional

Whether to remove hydrogens. Defaults to False.

sanitizebool, optional

Whether to sanitize the molecules. Defaults to False.

Returns

rdkit.Chem.rdchem.Mol

A RDKit molecule object.

hsr.similarity module

hsr.similarity.calculate_manhattan_distance(moments1: list, moments2: list)[source]

Calculate the manhattan distance between two lists.

Parameters

moments1list

The first list of numerical values.

moments2list

The second list of numerical values, must be of the same length as moments1.

Returns

float

The mean absolute difference between the two lists.

hsr.similarity.calculate_similarity_from_distance(distance, n_components)[source]

Calculate similarity score from a distance score.

This function converts a distance score into a similarity score using a reciprocal function. The distance is first normalized by the number of components of the fingerprint. The similarity score approaches 1 as the difference score approaches 0, and it approaches 0 as the difference score increases.

Parameters

partial_scorefloat

The difference score, a non-negative number.

n_componentsint

The number of components in the fingerprint.

Returns

float

The similarity score derived from the distance.

hsr.similarity.compute_distance(mol1, mol2, features={'delta_neutrons': <function extract_neutron_difference_from_common_isotope>, 'formal_charges': <function extract_formal_charge>, 'protons': <function extract_proton_number>}, scaling='matrix', removeHs=False, chirality=False)[source]

Calculate the distance score between two molecules using their n-dimensional fingerprints.

This function generates fingerprints for two molecules based on their structures and a set of features, and then computes a distance score between these fingerprints.

Parameters

mol1RDKit Mol

The first RDKit molecule object.

mol2RDKit Mol

The second RDKit molecule object.

featuresdict, optional

Dictionary of features to be considered. Default is DEFAULT_FEATURES.

scalingstr, float, or np.ndarray

Specifies the scaling applied to reference points. If set to ‘matrix’ (default), a scaling matrix is automatically computed based on the PCA-transformed data. If a float is provided, it’s used as a scaling factor. If a numpy.ndarray is provided, it’s used as a scaling matrix.

removeHsbool, optional

If True, hydrogen atoms are removed from the molecule before generating the fingerprint.

chiralitybool, optional

Consider chirality in the generation of fingerprints if set to True.

Returns

float

The computed distance score between the two molecules.

hsr.similarity.compute_distance_from_ndarray(mol1_nd: array, mol2_nd: array, scaling='matrix', chirality=False)[source]

Calculate the distance score between two molecules represented as N-dimensional arrays.

This function computes fingerprints for two molecules based on their N-dimensional array representations and then calculates a distance score between these fingerprints.

Parameters

mol1_ndnumpy.ndarray

The N-dimensional array representing the first molecule.

mol2_ndnumpy.ndarray

The N-dimensional array representing the second molecule.

scalingstr, float, or np.ndarray

Specifies the scaling applied to reference points. If set to ‘matrix’ (default), a scaling matrix is automatically computed based on the PCA-transformed data. If a float is provided, it’s used as a scaling factor. If a numpy.ndarray is provided, it’s used as a scaling matrix.

chiralitybool, optional

Consider chirality in the generation of fingerprints if set to True.

Returns

float

The computed distance score between the two molecules.

hsr.similarity.compute_similarity(mol1, mol2, features={'delta_neutrons': <function extract_neutron_difference_from_common_isotope>, 'formal_charges': <function extract_formal_charge>, 'protons': <function extract_proton_number>}, scaling='matrix', removeHs=False, chirality=False)[source]

Calculate the similarity score between two molecules using their n-dimensional fingerprints.

This function generates fingerprints for two molecules based on their structures and a set of features, and then computes a similarity score between these fingerprints.

Parameters

mol1RDKit Mol

The first RDKit molecule object.

mol2RDKit Mol

The second RDKit molecule object.

featuresdict, optional

Dictionary of features to be considered. Default is DEFAULT_FEATURES.

scalingstr, float, or np.ndarray

Specifies the scaling applied to reference points. If set to ‘matrix’ (default), a scaling matrix is automatically computed based on the PCA-transformed data. If a float is provided, it’s used as a scaling factor. If a numpy.ndarray is provided, it’s used as a scaling matrix.

removeHsbool, optional

If True, hydrogen atoms are removed from the molecule before generating the fingerprint.

chiralitybool, optional

Consider chirality in the generation of fingerprints if set to True.

Returns

float

The computed similarity score between the two molecules.

hsr.similarity.compute_similarity_from_ndarray(mol1_nd: array, mol2_nd: array, scaling='matrix', chirality=False)[source]

Calculate the similarity score between two molecules represented as N-dimensional arrays.

This function computes fingerprints for two molecules based on their N-dimensional array representations and then calculates a similarity score between these fingerprints.

Parameters

mol1_ndnumpy.ndarray

The N-dimensional array representing the first molecule.

mol2_ndnumpy.ndarray

The N-dimensional array representing the second molecule.

scalingstr, float, or np.ndarray

Specifies the scaling applied to reference points. If set to ‘matrix’ (default), a scaling matrix is automatically computed based on the PCA-transformed data. If a float is provided, it’s used as a scaling factor. If a numpy.ndarray is provided, it’s used as a scaling matrix.

chiralitybool, optional

Consider chirality in the generation of fingerprints if set to True.

Returns

float

The computed similarity score between the two molecules.

hsr.similarity.compute_similarity_score(fingerprint_1: list, fingerprint_2: list)[source]

Calculate the similarity score between two fingerprints.

Parameters

fingerprint_1list

The fingerprint of the first molecule.

fingerprint_2list

The fingerprint of the second molecule.

Returns

float

The computed similarity score.

hsr.utils module

hsr.utils.compute_scaling_factor(molecule_data)[source]

Computes the largest distance between the centroid and the molecule data points

hsr.utils.compute_scaling_matrix(molecule_data)[source]

Computes a diagonal scaling matrix with the maximum absolute values for each dimension of the molecule data as its diagonal entries

hsr.utils.extract_formal_charge(atom)[source]
hsr.utils.extract_neutron_difference_from_common_isotope(atom)[source]
hsr.utils.extract_proton_number(atom)[source]
hsr.utils.formal_charge(atom)[source]
hsr.utils.neutron_difference(atom)[source]
hsr.utils.proton_number(atom)[source]

Module contents