sihnpy.spatial_extent

Module Contents

Functions

gmm_estimation(data_to_estimate[, fix])

Function estimating a 1- and a 2-cluster solution Gaussian Mixture Model. The Bayesian

_gmm_avg_sd(gm_obj)

Quick function extracting and returning the average and SD values of the two components

gmm_measures(cleaned_data, gm_objects[, fix])

For all data kept after GMM estimation, this function computes the averages and SDs

gmm_probs(final_data, final_gm_estimations[, fix])

Function extracting the probability to be in the "second" component (high abnormal values).

_gmm_density_histogram(regional_data, ...[, dist_2])

Histogram of the value DENSITIES with overlayed density function for each

_gmm_raw_histogram(regional_data, col)

Generates a simple histogram of the values in a given region. Can plot both the

gmm_histograms(final_data, gmm_measures, probs_df[, ...])

Optional function plotting histograms from the raw data, with overlayed density functions

gmm_threshold_deriv(final_data, probs_df, prob_threshs)

Function deriving the actual thresholds based on the probabilities of belonging to the

export_histograms(hist_dict_fig, output_path, name)

Exporting the histograms to file, if requested by user. Will export ALL

export_threshs(final_data, probs_data, thresh_df, ...)

Wrapper function exporting the final data used and the probability data to files.

apply_clean(data_to_apply, thresh_data[, index_name])

Function doing basic cleaning on the spatial extent and thresholds; just sorts

apply_masks(data_to_apply_clean, thresh_data_clean)

Function applying the thresholds to the data, resulting in binary masks. The binary masks

apply_index(data_to_apply_clean, dict_masks)

Create the spatial extent index, which is the sum of regions that are above the threshold.

apply_ind_mask(data_to_apply_clean, dict_masks)

Another way to leverage the spatial extent is by creating individualized spatial extent

export_spex_metrics(spex_metrics, output_path, name)

Function to export the spatial extent metrics.

export_spex_bin_masks(dict_masks, output_path, name)

Function to export the binary masks.

export_spex_ind_masks(spex_ind_masks, output_path, name)

Function to export the individualized spatial extent masks

sihnpy.spatial_extent.gmm_estimation(data_to_estimate, fix=False)[source]

Function estimating a 1- and a 2-cluster solution Gaussian Mixture Model. The Bayesian Information Criteria is output and compared between the two models.

Parameters
  • data_to_estimate (pandas.DataFrame) – Data where each column needs to be fed to the GMM. Any column where a GMM should NOT be estimated should have been removed

  • fix (bool, optional) – Whether sihnpy should remove regions where 1-component fits better the data than a 2-component model (using smallest Bayesian Information Criteria), by default False

Returns

Returns a dictionary with the GMM objects from scikit-learn and a pandas.DataFrame where columns were removed if fix is applied.

Return type

dict, pandas.DataFrame

sihnpy.spatial_extent._gmm_avg_sd(gm_obj)[source]

Quick function extracting and returning the average and SD values of the two components from the GMM estimation

Parameters

gm_obj (sklearn.mixture.GaussianMixture) – Takes a GMM object as input

Returns

Returns a dictionary for each GMM object, with the mean and SDs of each component.

Return type

dict

sihnpy.spatial_extent.gmm_measures(cleaned_data, gm_objects, fix=False)[source]

For all data kept after GMM estimation, this function computes the averages and SDs for both components.We then check that the order of the clusters is right and the measures are also used for the histograms in the spex.gmm_histograms function.

Parameters
  • cleaned_data (pandas.DataFrame) – Dataframe output from spex.gmm_estimation.

  • gm_objects (dict) – Dictionary of the sklearn.mixture.GaussianMixture objects to extract measures from.

  • fix (bool, optional) – If the mean of component 2 is lower than the mean of component 1, it suggests that the components are inverted. If fix is True, we remove the region from further calculations, by default False

Returns

Returns a Dataframe with clean data (if columns were removed by the fix), one dictionary with sklearn.mixture.GaussianMixture objects cleaned (if some estimations were removed) by fix and one dictionary with the averages/SDs of the two components, for regions kept.

Return type

pandas.DataFrame, dict, dict

sihnpy.spatial_extent.gmm_probs(final_data, final_gm_estimations, fix=False)[source]

Function extracting the probability to be in the “second” component (high abnormal values).

Parameters
  • final_data (pandas.DataFrame) – Cleaned dataframe output by spex.gmm_measures

  • final_gm_estimations (dict) – Cleaned dictionary of sklearn.mixture.GaussianMixture objects output by spex.gmm_measures

  • fix (bool, optional) – If inverted distributions are not removed in spex.gmm_measures, they can be manually inverted here by setting to True, by default False

Returns

Dataframe of the shape, index and columns from final_data. Contains probabilities of belonging to the “abnormal” distribution for each participant, for each region.

Return type

pandas.DataFrame

sihnpy.spatial_extent._gmm_density_histogram(regional_data, regional_gmm_measures, col, dist_2=True)[source]

Histogram of the value DENSITIES with overlayed density function for each GMM cluster.

Density is the count of each bin, divided by the total number of counts and the bin width. (Ref: Matplotlib documentation) This option is necessary to see the density curves.

Parameters
  • regional_data (pandas.Series) – Single column from the final_data object representing the data in one region.

  • regional_gmm_measures (dict) – Dictionary containing the mean and SD of each component.

  • col (str) – String containing the name of the region. Used mostly for labels on the graphs.

  • dist_2 (bool, optional) – Whether we want to plot one or two density functions (True == two), by default True

Returns

Returns matplotlib figure

Return type

matplotlib.pyplot.figure

sihnpy.spatial_extent._gmm_raw_histogram(regional_data, col)[source]

Generates a simple histogram of the values in a given region. Can plot both the probabilities and the raw values, as needed.

Parameters
  • regional_data (pandas.Series) – Single column of data for a single region (data or probabilities)

  • col (str) – Name of the region of interest

Returns

Returns matplotlib figure

Return type

matplotlib.pyplot.figure

sihnpy.spatial_extent.gmm_histograms(final_data, gmm_measures, probs_df, dist_2=True, type='density')[source]

Optional function plotting histograms from the raw data, with overlayed density functions for both clusters.

Parameters
  • final_data (pandas.DataFrame) – Dataframe from spex.gmm_measures with final columns to plot.

  • gmm_measures (dict) – Nested dictionary containing the mean and SDs of each component, for each region.

  • probs_df (pandas.DataFrame) – Dataframe of the probabilities of belonging to the “abnormal” distribution, from the spex.gmm_probs function.

  • dist_2 (bool, optional) – Whether we want to plot one or two density functions (True == two) if we plot density, by default True

  • type (str, optional) – Type of histogram to plot (“density”, “raw”, “probs”, “all”), by default “density”.

Returns

Returns a dictionary of matplotlib figures.

Return type

dict

sihnpy.spatial_extent.gmm_threshold_deriv(final_data, probs_df, prob_threshs, improb=None)[source]

Function deriving the actual thresholds based on the probabilities of belonging to the “abnormal” distribution.

Depending on the threshold value used, the probability of belonging to a given component can be inverted (e.g., the 50% probability threshold may have a higher value than the 90% threshold.). This usually happens when the second component is very spread out and overlaps with the first component. If that is the case, the use of the improb argument is recommended.

Also note that to give more flexibility to the user, sihnpy allows for a list of thresholds to be given to derive multiple thresholds. However, sihnpy doesn’t check whether the order of the thresholds make sense (e.g., that 50% comes before 90%) and assumes the user put them in the right order. It is up to the user to check this once the thresholds are derived.

Parameters
  • final_data (pandas.DataFrame) – Final data derived from spex.gmm_measures.

  • probs_df (pandas.DataFrame) – Dataframe containing the probabilities of belonging to the “abnormal” distribution, from the spex.gmm_probs function.

  • prob_threshs (list of float) – List of thresholds to apply to the data. Thresholds have to range between 0 and 1.

  • improb (float, optional) – Value below which an “abnormal” value is improbable or impossible. Useful in the case that the GMM is very spread out, by default None

Returns

Dataframe where rows are the regions and columns are the thresholds derived from the probabilities.

Return type

pandas.DataFrame

sihnpy.spatial_extent.export_histograms(hist_dict_fig, output_path, name)[source]

Exporting the histograms to file, if requested by user. Will export ALL histograms saved to the dictionary

Parameters
  • hist_dict_fig (dict) – Dictionary of histogram figures from spex.gmm_histograms

  • output_path (str) – String of the path to where the output should go

  • name (str) – Name that should be tacked at the end of the file name, depending on the user’s conventions.

sihnpy.spatial_extent.export_threshs(final_data, probs_data, thresh_df, output_path, name)[source]

Wrapper function exporting the final data used and the probability data to files.

Parameters
  • final_data (pandas.DataFrame) – Final data derived from spex.gmm_measures.

  • probs_df (pandas.DataFrame) – Dataframe containing the probabilities of belonging to the “abnormal” distribution, from the spex.gmm_probs function.

  • thresh_df (pandas.DataFrame) – Dataframe containing the thresholds we just derived

  • output_path (str) – String of the path to where the output should go

  • name (str) – Name that should be tacked at the end of the file name, depending on the user’s conventions.

sihnpy.spatial_extent.apply_clean(data_to_apply, thresh_data, index_name=None)[source]

Function doing basic cleaning on the spatial extent and thresholds; just sorts the rows and make sure they match between the thresholds and data to apply.

Parameters
  • data_to_apply (pandas.DataFrame) – Data on which we want to apply thresholds. Columns should match rows of thresh_data.

  • thresh_data (pandas.DataFrame) – Thresholds to be applied to the data. Rows should match columns of data_to_apply.

  • index_name (str, optional) – String indicating the name of the column that should be considered as the pandas.DataFrame.Index. By default, assume it’s already set; by default None

Returns

Returns pandas.DataFrame of the data, where the columns of the data shares the same order as the rows of the thresholds.

Return type

pandas.DataFrame

sihnpy.spatial_extent.apply_masks(data_to_apply_clean, thresh_data_clean)[source]

Function applying the thresholds to the data, resulting in binary masks. The binary masks have the same shape as the original data (rows are participants, columns are regions). The number of masks depends on the number of thresholds (columns) in thresh_data_clean.

Parameters
  • data_to_apply_clean (pandas.DataFrame) – Data to which we want to apply the spatial extent, where columns are regions and rows are participants. From spex.apply_clean.

  • thresh_data_clean (pandas.DataFrame) – Dataframe containing the threshold data, where rows are regions and columns are thresholds. From spex.apply_clean

Returns

Returns a dictionary of pandas.DataFrame`s, where each `DataFrame contains binary values for each region, for each participant.

Return type

dict

sihnpy.spatial_extent.apply_index(data_to_apply_clean, dict_masks)[source]

Create the spatial extent index, which is the sum of regions that are above the threshold. In the case where multiple thresholds are available we output the sum of each thresholds individually, as well as the total sum of all thresholds together.

Parameters
  • data_to_apply_clean (pandas.DataFrame) – Original dataframe cleaned with spex.apply_clean. Only used to get the index to ensure the spatial extent is the same order.

  • dict_masks (dict) – Dictionary containing all the binary masks from spex.apply_masks

Returns

Dataframe containing the spatial extent index for each threshold.

Return type

pandas.DataFrame

sihnpy.spatial_extent.apply_ind_mask(data_to_apply_clean, dict_masks)[source]

Another way to leverage the spatial extent is by creating individualized spatial extent masks. The idea is that simply add weights to the original data, based on the probability of being abnormal in a given region.

For instance, if a participant has a 90% probability of being positive, vs a 50% probability of being positive, we give more weight to the 90% probability value by multiplying it by a different constant.

Parameters
  • data_to_apply_clean (pandas.DataFrame) – Original dataframe cleaned with spex.apply_clean.

  • dict_masks (dict) – Dictionary containing all the binary masks from spex.apply_masks

Returns

Dictionary of individualized spatial extent masks.

Return type

dict

sihnpy.spatial_extent.export_spex_metrics(spex_metrics, output_path, name)[source]

Function to export the spatial extent metrics.

Parameters
  • spex_metrics (pandas.DataFrame) – Dataframe containing the spatial extent indices.

  • output_path (str) – Path where the dataframe should be output.

  • name (str) – String that should be tacked at the end of the file name based on user convention.

sihnpy.spatial_extent.export_spex_bin_masks(dict_masks, output_path, name)[source]

Function to export the binary masks.

Parameters
  • dict_masks (dict) – Dictionary of binary masks where the thresholds were applied.

  • output_path (str) – Path where the dataframe should be output.

  • name (str) – String that should be tacked at the end of the file name based on user convention.

sihnpy.spatial_extent.export_spex_ind_masks(spex_ind_masks, output_path, name)[source]

Function to export the individualized spatial extent masks

Parameters
  • spex_ind_masks (dict) – Dictionary of individualized spatial extent masks

  • output_path (str) – Path where the dataframe should be output.

  • name (str) – String that should be tacked at the end of the file name based on user convention.