Additional topics

Want to learn more about the literature on deriving thresholds on PET scan data in Alzheimer’s disease? Want to know about how to derive your own simulated data to test the spatial extent module with different distributions? You’ve come to the right place!

Creating Gaussian simulated data

Inspiration from realistic data

As you saw in the tutorial of the spex module, I generated random data to be used in the module. The goal was to mimic as close as possible what one could reasonable expect to see in a dataset with tau-PET data, but at the same time making sure sihnpy could derive thresholds relatively easily AND that I could showcase potential issues with the data.

As I describe briefly in my short introduction to GMM, tau-PET data in the Alzheimer’s disease spectrum can usually be described as a bimodal distribution, where there is a large group of individuals with low PET SUVR and a small group with high PET SUVR. As the disease progress from normal cognition to diagnosed Alzheimer’s disease, individuals will usually progress from low to high. So we know that to generate random data, we will need to generate data from a low and a high distribution.

In Python, we can do this relatively easily if we have the mean and standard deviation for both the low and high distribution. So how do we choose that? At the time of developping sihnpy, and in the paper in which we use this methodology,1 we were working with data from the Alzheimer’s disease neuroimaging initiative (ADNI), one of the largest repository of open data on Alzheimer’s disease, including a lot of amyloid- and tau-PET scans. In this dataset, we have participants that are at very low risk of progressing to Alzheimer’s disease (participants who are cognitively unimpaired, with no discernable amounts of amyloid pathology) and participants who have already progressed to Alzheimer’s disease (participants meeting the diagnostic criteria and with a lot of amyloid pathology). In sihnpy, the simulated data was created by slightly modifying the mean and standard deviation of these two groups of participants from ADNI. This information is actually available to all using sihnpy when using the datasets module.

from sihnpy import datasets
tau_data, regional_thresholds, regional_averages = datasets.pad_spex_input()
regional_averages

	mean_negative	sd_negative	mean_positive	sd_positive
region
CTX_LH_ENTORHINAL_SUVR	1.119	0.110	1.578	0.321
CTX_RH_ENTORHINAL_SUVR	1.122	0.099	1.582	0.300
CTX_LH_AMYGDALA_SUVR	1.185	0.112	1.642	0.299
CTX_RH_AMYGDALA_SUVR	1.187	0.107	1.632	0.272
CTX_LH_FUSIFORM_SUVR	1.185	0.116	1.690	0.506
CTX_RH_FUSIFORM_SUVR	1.175	0.078	1.665	0.485
CTX_LH_PARAHIPPOCAMPAL_SUVR	1.097	0.095	1.450	0.292
CTX_RH_PARAHIPPOCAMPAL_SUVR	1.091	0.079	1.442	0.313
CTX_LH_INFERIORTEMPORAL_SUVR	1.199	0.132	1.799	0.548
CTX_RH_INFERIORTEMPORAL_SUVR	1.190	0.078	1.774	0.554
CTX_LH_MIDDLETEMPORAL_SUVR	1.161	0.130	1.671	0.523
CTX_RH_MIDDLETEMPORAL_SUVR	1.162	0.077	1.674	0.516
CTX_LH_PRECENTRAL_SUVR	0.997	0.070	1.139	0.264
CTX_RH_PRECENTRAL_SUVR	0.995	0.073	1.133	0.256
CTX_LH_POSTCENTRAL_SUVR	0.972	0.074	1.084	0.252
CTX_RH_POSTCENTRAL_SUVR	0.971	0.074	1.091	0.286

These numbers are what was used to generate sihnpy data, with a few exceptions. The goal was simply to get “realistic” data.

Generating randomized data

From there, we have another choice to make. In the PREVENT-AD data, we have 308 participants in the Open Dataset. We need to choose how many are going to be attributed a “normal” and how many are going to be attributed an “abnormal” value. This is really arbitrary, so feel free to modify the numbers as needed. I created a simple function that generates the randomized data:

import numpy as np
from scipy import stats

def gen_random_population(mean1, sd1, size1, mean2, sd2, size2):
    """Generates random data mimicking tau-PET SUVR data, of the size of the PREVENT-AD Open
    dataset, by pulling 308 sample data points from 2 randomly generated populations.

    Parameters
    ----------
    mean1 : float
        Mean of the first distribution
    sd1 : float
        Standard deviation of the first distribution
    size1 : int
        Size of the first population
    mean2 : float
        Mean of the second distribution
    sd2 : float
        Standard deviation of the second distribution
    size2 : int
        Size of the second population

    Returns
    -------
    numpy.array
        Numpy array containing the 308 random data points; 100 AD, 208 CU
    """

    #Create a random population based on CU and on AD. Concatenate
    pop_sim_neg_data = np.random.choice(stats.norm.rvs(mean1, sd1, size1, random_state=667), 208)
    pop_sim_pos_data = np.random.choice(stats.norm.rvs(mean2, sd2, size2, random_state=667), 100)

    return np.concatenate((pop_sim_neg_data, pop_sim_pos_data))

Let’s breakdown how this function is working. The core of the function is scipy’s stats.norm.rvs function, which generates random samples of participants based on a given mean, standard deviation and size. I forced the random_state to be constant here, so the generated data would remain the same throughout testing, but it is up to you to do the same or not.

Then, once the random samples are generated, we use numpy’s random.choice to randomly choose data points of the dimension we want (so for us, we will arbitrarily say there are 100 participants with high SUVR values and 208 participants with low SUVR values). This size choice is a bit different then the one inside of stats.norm.rvs. In stats.norm.rvs, the size is how many random samples it should create, while the size in np.random.choice is how many samples numpy should keep.

From the tests I did before generating the final data, it is usually better to give a very large number of samples to stats.norm.rvs (e.g., 10,000) so that it gives more variety and stability to the final values numpy chooses.

Once you have that function, you can simply generate a pandas.DataFrame or a dict containing the random data. Here is the code I used for the data in sihnpy

import pandas as pd

dict_random_tau_data = {}

dict_random_tau_data["CTX_LH_ENTORHINAL_SUVR"] = gen_random_population(mean1=1.119, sd1=0.110, size1=10000, mean2=1.578, sd2=0.321, size2=10000)
dict_random_tau_data["CTX_RH_ENTORHINAL_SUVR"] = gen_random_population(mean1=1.122, sd1=0.099, size1=10000, mean2=1.582, sd2=0.300, size2=10000)
dict_random_tau_data["CTX_LH_AMYGDALA_SUVR"] = gen_random_population(mean1=1.125, sd1=0.112, size1=10000, mean2=1.642, sd2=0.299, size2=10000)
dict_random_tau_data["CTX_RH_AMYGDALA_SUVR"] = gen_random_population(mean1=1.127, sd1=0.107, size1=10000, mean2=1.632, sd2=0.272, size2=10000)
dict_random_tau_data["CTX_LH_FUSIFORM_SUVR"] = gen_random_population(mean1=1.185, sd1=0.112, size1=10000, mean2=1.690, sd2=0.506, size2=10000)
dict_random_tau_data["CTX_RH_FUSIFORM_SUVR"] = gen_random_population(mean1=1.175, sd1=0.078, size1=10000, mean2=1.665, sd2=0.485, size2=10000)
dict_random_tau_data["CTX_LH_PARAHIPPOCAMPAL_SUVR"] = gen_random_population(mean1=1.097, sd1=0.095, size1=10000, mean2=1.450, sd2=0.292, size2=10000)
dict_random_tau_data["CTX_RH_PARAHIPPOCAMPAL_SUVR"] = gen_random_population(mean1=1.091, sd1=0.079, size1=10000, mean2=1.442, sd2=0.313, size2=10000)
dict_random_tau_data["CTX_LH_INFERIORTEMPORAL_SUVR"] = gen_random_population(mean1=1.199, sd1=0.132, size1=10000, mean2=1.799, sd2=0.548, size2=10000)
dict_random_tau_data["CTX_RH_INFERIORTEMPORAL_SUVR"] = gen_random_population(mean1=1.190, sd1=0.078, size1=10000, mean2=1.774, sd2=0.554, size2=10000)
dict_random_tau_data["CTX_LH_MIDDLETEMPORAL_SUVR"] = gen_random_population(mean1=1.161, sd1=0.130, size1=10000, mean2=1.671, sd2=0.523, size2=10000)
dict_random_tau_data["CTX_RH_MIDDLETEMPORAL_SUVR"] = gen_random_population(mean1=1.162, sd1=0.077, size1=10000, mean2=1.674, sd2=0.516, size2=10000)
dict_random_tau_data["CTX_LH_PRECENTRAL_SUVR"] = gen_random_population(mean1=0.997, sd1=0.070, size1=10000, mean2=1.139, sd2=0.264, size2=10000)
dict_random_tau_data["CTX_RH_PRECENTRAL_SUVR"] = gen_random_population(mean1=1.200, sd1=0.045, size1=10000, mean2=0.995, sd2=0.074, size2=10000) #Mimics inverted distribution
dict_random_tau_data["CTX_LH_POSTCENTRAL_SUVR"] = gen_random_population(mean1=0.972, sd1=0.074, size1=10000, mean2=1.084, sd2=0.252, size2=10000)
dict_random_tau_data["CTX_RH_POSTCENTRAL_SUVR"] = gen_random_population(mean1=1.091, sd1=0.286, size1=10000, mean2=1.091, sd2=0.286, size2=10000) #Mimics single distribution

simulated_data = pd.DataFrame(data=dict_random_tau_data) #Fit in a dataframe
simulated_data

	CTX_LH_ENTORHINAL_SUVR	CTX_RH_ENTORHINAL_SUVR	CTX_LH_AMYGDALA_SUVR	CTX_RH_AMYGDALA_SUVR	CTX_LH_FUSIFORM_SUVR	CTX_RH_FUSIFORM_SUVR	CTX_LH_PARAHIPPOCAMPAL_SUVR	CTX_RH_PARAHIPPOCAMPAL_SUVR	CTX_LH_INFERIORTEMPORAL_SUVR	CTX_RH_INFERIORTEMPORAL_SUVR	CTX_LH_MIDDLETEMPORAL_SUVR	CTX_RH_MIDDLETEMPORAL_SUVR	CTX_LH_PRECENTRAL_SUVR	CTX_RH_PRECENTRAL_SUVR	CTX_LH_POSTCENTRAL_SUVR	CTX_RH_POSTCENTRAL_SUVR
0	1.050977	1.093633	1.248796	0.971204	1.213006	1.168768	1.035867	1.168648	1.274241	1.131510	1.516734	1.241257	1.003841	1.239982	0.929840	1.507824
1	1.127510	1.017760	0.966256	1.085412	1.366134	1.169626	1.160098	1.095804	1.251569	1.270666	1.175667	1.199665	1.080894	1.136258	0.940067	1.174303
2	1.013789	1.137911	1.019332	0.956065	1.202347	1.322439	1.011055	1.131832	1.122195	1.147575	1.303427	1.246672	1.078963	1.149971	0.995944	0.478125
3	0.972404	1.060044	1.092585	1.291303	1.206834	1.247639	1.117978	1.068688	1.162280	1.091293	1.415403	1.114634	0.975131	1.240279	1.016704	1.010808
4	1.015437	1.121641	1.182642	1.092710	1.265941	1.085077	1.049168	1.104917	1.356587	1.016231	1.307198	1.104369	0.947937	1.197298	1.072664	0.651698
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
303	2.070946	1.331910	1.517221	1.580206	1.037356	1.613854	1.395437	0.915751	1.482730	1.943130	2.005232	2.572297	0.910657	0.934197	1.775944	1.405848
304	1.532058	1.484764	1.173318	1.834541	1.494277	1.177888	1.732385	1.277307	1.295207	2.213865	1.370325	1.144288	1.653183	1.076456	0.767494	0.944836
305	1.883837	1.628466	1.357189	1.354437	1.015956	1.539395	1.153276	1.357335	0.989533	2.211486	2.141548	2.314284	1.431442	1.026745	0.952673	0.754707
306	1.373594	1.747094	1.661831	1.826777	1.124029	1.265773	1.330975	1.594431	1.689485	2.361422	1.892912	1.943493	0.941921	0.890483	0.862833	1.137686
307	1.821488	1.804308	1.513617	1.821368	2.266935	3.260466	1.752214	1.410401	2.132434	1.783828	1.466281	1.579780	1.141350	0.966034	0.835235	0.421032

308 rows × 16 columns

Note that I purposefully changed two distributions in the data to simulate potential situations: first I inverted the low and high distributions for the CTX_RH_PRECENTRAL_SUVR, and I also forced a single distribution for the CTX_RH_POSTCENTRAL_SUVR. It is simulated data: it’s up to you to try and simulate a problem you want to check using the spatial extent.

From there you can simply merge this data to the IDs of the PREVENT-AD (or your own IDs) and you have a fully simulated dataset!

Advanced topics: Going further than two components

Once you understand the concepts behind the spatial extent, you can run with it and create your own. For instance, what if your data had not 1, not 2, but 3 distributions? You could think of deriving multiple sets of thresholds (low to medium, medium to high for instance).

Accordingly, you can use the examples on this page to build new simulated data fitting your needs, on which you can test the functions.

Literature - Abnormality in PET data

sihnpy regroups tools meant to serve the neuroimaging field as broadly as possible. As such, while I talk a bit about the literature behind each module, I don’t really go in depth behind the reasoning of some of the module but rather provide ressources and references meant to be used to go further.

That said, I feel like learning a bit more about the Alzheimer’s literature on the topic may help guide you in terms of choosing the right methodology and thresholds for your own data, whatever population you apply it to, so I created this extra section, discussing a bit on the literature of Alzheimer’s and also discusses how to choose your thresholds.

Studies of pathology in Alzheimer’s disease is usually focused on the two main pathological hallmarks: amyloid and tau pathology. Some studies have used Gaussian mixture modelling before to generate thresholds to determine what is “normal” or “abnormal”. Using amyloid pathology for instance, some authors have opted for values above 90th percentile of belonging to the “normal” distribution.2 3 This is a somewhat more liberal threshold, but goes with the rationale that once the probability that a person belongs to the “normal” distribution drops, we become uncertain of whether the person can still be considered “normal”.

To use this type of threshold in sihnpy, you simply have to set the “abnormality” threshold to 0.1 (i.e., the opposite of 90% probability of belonging to the “normal” distribution is 10% of belonging to the “abnormal” distribution).

A different philosophy to set these thresholds is to directly go with the probability of belonging to the “abnormal” distribution (i.e., the philosophy behind sihnpy’s spatial_extent module). 4 5 6 This usually yields slightly more conservative thresholds, as probability of abnormality increases with PET uptake.

So what threshold should you choose?

That really depends on whether you need a more liberal/sensitive threshold or a more conservative/specific threshold. You also have to be mindful of the overlap between the two distribution. In some cases, the two distributions will overlap quite a lot, meaning that the probabilities may yield thresholds of very low values, as we saw in the tutorial.

sihnpy is already set so that any probability thresholds given reflect the probability of belonging to the “abnormal” distribution. Previous work used a 50% probability in amyloid-PET data. 6 In our recent work, we opted for 50% probability of belonging to the “abnormal” distribution as well, but we found similar results using a conservative threshold at 90%. 1

My recommendation is to choose 1 main threshold and 2-3 other thresholds to replicate your results. That way you can make sure that the thresholds set make sense with your data. My other recommendation is to always look at the data to ensure that the thresholds you derive are biologically plausible.

References

Need for info? Make sure to go read these sources:

1(1,2): St-Onge et al. (2023). In preparation.
2: Ozlen H, Pichet Binette A, Köbe T, Meyer PF, Gonneaud J, St-Onge F, Provost K, Soucy JP, Rosa-Neto P, Breitner J, Poirier J, Villeneuve S; Alzheimer’s Disease Neuroimaging Initiative, the Harvard Aging Brain Study, the Presymptomatic Evaluation of Experimental or Novel Treatments for Alzheimer Disease Research Group. Spatial Extent of Amyloid-β Levels and Associations With Tau-PET and Cognition. JAMA Neurol. 2022 Oct 1;79(10):1025-1035. doi: 10.1001/jamaneurol.2022.2442. PMID: 35994280; PMCID: PMC9396472.
3: Villeneuve S, Rabinovici GD, Cohn-Sheehy BI, Madison C, Ayakta N, Ghosh PM, La Joie R, Arthur-Bentil SK, Vogel JW, Marks SM, Lehmann M, Rosen HJ, Reed B, Olichney J, Boxer AL, Miller BL, Borys E, Jin LW, Huang EJ, Grinberg LT, DeCarli C, Seeley WW, Jagust W. Existing Pittsburgh Compound-B positron emission tomography thresholds are too high: statistical and pathological evaluation. Brain. 2015 Jul;138(Pt 7):2020-33. doi: 10.1093/brain/awv112. Epub 2015 May 6. PMID: 25953778; PMCID: PMC4806716.
4: Vogel JW, Iturria-Medina Y, Strandberg OT, Smith R, Levitis E, Evans AC, Hansson O; Alzheimer’s Disease Neuroimaging Initiative; Swedish BioFinder Study. Spread of pathological tau proteins through communicating neurons in human Alzheimer’s disease. Nat Commun. 2020 May 26;11(1):2612. doi: 10.1038/s41467-020-15701-2. Erratum in: Nat Commun. 2021 Aug 5;12(1):4862. PMID: 32457389; PMCID: PMC7251068.
5: Franzmeier N, Dewenter A, Frontzkowski L, Dichgans M, Rubinski A, Neitzel J, Smith R, Strandberg O, Ossenkoppele R, Buerger K, Duering M, Hansson O, Ewers M. Patient-centered connectivity-based prediction of tau pathology spread in Alzheimer’s disease. Sci Adv. 2020 Nov 27;6(48):eabd1327. doi: 10.1126/sciadv.abd1327. PMID: 33246962; PMCID: PMC7695466.
6(1,2): Mormino EC, Betensky RA, Hedden T, Schultz AP, Ward A, Huijbers W, Rentz DM, Johnson KA, Sperling RA; Alzheimer’s Disease Neuroimaging Initiative; Australian Imaging Biomarkers and Lifestyle Flagship Study of Ageing; Harvard Aging Brain Study. Amyloid and APOE ε4 interact to influence short-term decline in preclinical Alzheimer disease. Neurology. 2014 May 20;82(20):1760-7. doi: 10.1212/WNL.0000000000000431. Epub 2014 Apr 18. PMID: 24748674; PMCID: PMC4035706.