Fingerprinting analysis - Tabular data: Step-by-step

Ok, it’s all great to know about the rationale and everything behind fingerprinting, but now let’s get to the fun part of it: how do we actually use the module?

Here, I demonstrate, step-by-step, how to run the fingerprinting analysis. You can follow along once you have installed sihnpy and opened a Jupyter Notebook.

Note

Note that the functions described on this page will only work tabular data (i.e., spreadsheets). If you want to use fingerprinting with matrix-like data (one matrix per session, per participant), sihnpy offers a different set of functions for that purpose.

Specifically, the functions here only accept pandas.DataFrame.

Already read the tutorial before and you just want the code (a.k.a. too long; didn’t read)? Head on out to the tl;dr section.

Preparing the data

As always, sihnpy comes shipped with data for you to practice with. In this case, we use T1-weightedstructural magnetic resonance imaging data processed using FreeSurfer. To import it, you simply need the following code:

from sihnpy import datasets

volume_data, thickness_data, aseg_data = datasets.pad_fptab_input()
volume_data

	session	run	ctx_lh_bankssts_volume	ctx_lh_caudalanteriorcingulate_volume	ctx_lh_caudalmiddlefrontal_volume	ctx_lh_cuneus_volume	ctx_lh_entorhinal_volume	ctx_lh_fusiform_volume	ctx_lh_inferiorparietal_volume	ctx_lh_inferiortemporal_volume	...	ctx_rh_rostralanteriorcingulate_volume	ctx_rh_rostralmiddlefrontal_volume	ctx_rh_superiorfrontal_volume	ctx_rh_superiorparietal_volume	ctx_rh_superiortemporal_volume	ctx_rh_supramarginal_volume	ctx_rh_frontalpole_volume	ctx_rh_temporalpole_volume	ctx_rh_transversetemporal_volume	ctx_rh_insula_volume
participant_id
sub-1000173	ses-FU12	run-001	2046.0	1446.0	6174.0	2971.0	2560.0	8953.0	10505.0	8728.0	...	1536.0	14428.0	21615.0	11647.0	10390.0	9711.0	1075.0	3316.0	830.0	6757.0
sub-1002928	ses-BL00	run-001	2572.0	2188.0	7326.0	2961.0	1915.0	9140.0	11575.0	9313.0	...	1649.0	13692.0	20785.0	10793.0	10916.0	9025.0	950.0	2167.0	937.0	7021.0
sub-1004359	ses-BL00	run-001	1768.0	1333.0	4424.0	3324.0	1695.0	8514.0	10465.0	10928.0	...	2033.0	10784.0	17373.0	10712.0	10921.0	8554.0	1329.0	2508.0	718.0	5668.0
sub-1004359	ses-FU12	run-001	1845.0	1302.0	4471.0	3303.0	1758.0	8381.0	10413.0	10702.0	...	2235.0	11402.0	17398.0	10559.0	11107.0	8597.0	1164.0	2480.0	735.0	5704.0
sub-1016072	ses-BL00	run-001	2234.0	1721.0	6435.0	2831.0	1421.0	8454.0	11209.0	9532.0	...	1758.0	14514.0	16783.0	11713.0	11079.0	9948.0	1138.0	2562.0	998.0	6466.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
sub-9930257	ses-BL00	run-001	1692.0	1586.0	5212.0	3144.0	1641.0	8405.0	9919.0	9672.0	...	1813.0	14536.0	18152.0	13127.0	10875.0	8554.0	1271.0	2705.0	1066.0	6363.0
sub-9931234	ses-BL00	run-001	1886.0	1191.0	5521.0	2587.0	1628.0	9728.0	10930.0	11052.0	...	2118.0	16833.0	18207.0	13311.0	10490.0	8442.0	1135.0	2398.0	942.0	6929.0
sub-9931234	ses-FU12	run-001	1779.0	1166.0	5493.0	2310.0	1672.0	9822.0	10589.0	10550.0	...	2003.0	16640.0	18026.0	13163.0	10340.0	8363.0	1313.0	2484.0	932.0	6859.0
sub-9939055	ses-BL00	run-001	1785.0	2550.0	4754.0	1808.0	1604.0	8510.0	8750.0	8258.0	...	1975.0	13286.0	17322.0	10076.0	9996.0	7155.0	981.0	2212.0	651.0	5746.0
sub-9939055	ses-FU12	run-001	1795.0	2467.0	4621.0	1861.0	1731.0	8383.0	8613.0	8176.0	...	1898.0	12819.0	17319.0	9843.0	9922.0	7258.0	1097.0	2117.0	645.0	5298.0

541 rows × 70 columns

Let’s take, for example, the volumetric data printed above. This dataset is in long format, i.e., each visit for each participant has its own row. You can distinguish the different visits by the session variable (either BL00 or FU12, representing Baseline and Follow-up at 12-months). The run variable is not particularly useful for you; for some PREVENT-AD participant who didn’t have a good first run of imaging, a second run was taken. In sihnpy’s data, we kept run-002 if it was there.

The rest of the columns (all starting with ctx) are the actual data in each cortical region, where lh is the region in the left hemisphere, and rh is the region in the right hemisphere.

Note that you can use the fingerprinting with the volume, thickness and even, the aseg data. However, note that aseg doesn’t only hold volume data and include variables like total intracranial volume. Be careful if using this data.

Let’s move on to the fingerprinting!

Fingerprinting steps

1. Importing and cleaning the data

As you may have seen in the matrix data fingerprinting, cleaning the data is a very important and very… long step. Thankfully, this is much shorter in this fingerprinting version, but there are a couple of critical steps that need to be done:

Your dataframe needs to have an index, where the index are the participant IDs (sihnpy relies on the index on multiple occasions)
Each participant must have at least 2 visits (i.e., two rows) in the dataframe
Your dataframe needs to be in long format, with one variable specifying which “visit” we are talking about
The columns you want to use in your dataframe for fingerprinting must all start with the same suffix

Thankfully, the data sihnpy provides is already formatted for these requirements (it would not be super fair if I gave you data that wasn’t ready to use right?).

Let’s clean the data:

from sihnpy import fingerprinting as fp 

data_bl, data_fu = fp.import_fingerprint_data(volume_data, var='session')
data_bl

	session	run	ctx_lh_bankssts_volume	ctx_lh_caudalanteriorcingulate_volume	ctx_lh_caudalmiddlefrontal_volume	ctx_lh_cuneus_volume	ctx_lh_entorhinal_volume	ctx_lh_fusiform_volume	ctx_lh_inferiorparietal_volume	ctx_lh_inferiortemporal_volume	...	ctx_rh_rostralanteriorcingulate_volume	ctx_rh_rostralmiddlefrontal_volume	ctx_rh_superiorfrontal_volume	ctx_rh_superiorparietal_volume	ctx_rh_superiortemporal_volume	ctx_rh_supramarginal_volume	ctx_rh_frontalpole_volume	ctx_rh_temporalpole_volume	ctx_rh_transversetemporal_volume	ctx_rh_insula_volume
participant_id
sub-1004359	ses-BL00	run-001	1768.0	1333.0	4424.0	3324.0	1695.0	8514.0	10465.0	10928.0	...	2033.0	10784.0	17373.0	10712.0	10921.0	8554.0	1329.0	2508.0	718.0	5668.0
sub-1016072	ses-BL00	run-001	2234.0	1721.0	6435.0	2831.0	1421.0	8454.0	11209.0	9532.0	...	1758.0	14514.0	16783.0	11713.0	11079.0	9948.0	1138.0	2562.0	998.0	6466.0
sub-1072774	ses-BL00	run-001	2366.0	1753.0	5689.0	2570.0	1681.0	8522.0	11084.0	8635.0	...	1744.0	14151.0	16346.0	11987.0	9106.0	9915.0	943.0	2725.0	643.0	5986.0
sub-1076159	ses-BL00	run-001	2403.0	795.0	5211.0	2595.0	1918.0	8779.0	11293.0	11561.0	...	2211.0	13838.0	20560.0	13559.0	12788.0	11333.0	1467.0	2525.0	847.0	7030.0
sub-1154932	ses-BL00	run-001	3354.0	2513.0	5665.0	2870.0	1804.0	9756.0	12505.0	10587.0	...	2302.0	14159.0	19629.0	10676.0	10926.0	8423.0	1079.0	3148.0	712.0	7060.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
sub-9865768	ses-BL00	run-001	1859.0	1703.0	4267.0	2353.0	1420.0	7183.0	8749.0	6648.0	...	1694.0	12451.0	16895.0	9819.0	9702.0	7613.0	1117.0	2409.0	759.0	6087.0
sub-9889544	ses-BL00	run-001	1826.0	2437.0	7097.0	3055.0	1948.0	8378.0	10641.0	11911.0	...	2196.0	15937.0	20839.0	12166.0	11910.0	10482.0	1001.0	2479.0	954.0	6049.0
sub-9909448	ses-BL00	run-002	2813.0	1368.0	6195.0	2467.0	1930.0	9139.0	10930.0	11865.0	...	1952.0	16274.0	16658.0	11004.0	10621.0	10026.0	1302.0	3313.0	809.0	6644.0
sub-9931234	ses-BL00	run-001	1886.0	1191.0	5521.0	2587.0	1628.0	9728.0	10930.0	11052.0	...	2118.0	16833.0	18207.0	13311.0	10490.0	8442.0	1135.0	2398.0	942.0	6929.0
sub-9939055	ses-BL00	run-001	1785.0	2550.0	4754.0	1808.0	1604.0	8510.0	8750.0	8258.0	...	1975.0	13286.0	17322.0	10076.0	9996.0	7155.0	981.0	2212.0	651.0	5746.0

234 rows × 70 columns

Ok, let’s deconstruct what just happened. The function import_fingerprint_data does two things: 1) it removes participants who only have 1 row as we need two visits to compute self-identifiability and 2) Splits the first and second visit in two dataframes (it simplifies the code and calculations). From there, we remain with two dataframes; one with the baseline data and one with the follow-up data.

We are now ready for fingerprinting.

Fix: what if I have more than two visits per participant?

In cohort studies (PREVENT-AD included), participants may be followed for much more than two years. This complicates matter as it forces the user of the fingerprint methodology to choose which visit to use for fingerprinting. In sihnpy, the option currently implemented is to fingerprint the first to the last visit available.

If you have over two visits in your study, there are usually two types of analyses you might want to do: fingerprint two specific visits (e.g., first and last) or want to calculate the change over time. In the first case, before using sihnpy you need to make sure that the session variable is ordered correctly for each participant. This is usually (but not always) the case, so it needs to be checked before use. In the second case, you will have to create individual dataframes for each pair of visit you would like to fingerprint.

Depending on the interest shown, I could modify the code to remove this step before sihnpy, but it is currently not in the plans.

Advanced users: Unequal number of participants between visits.

In many fingerprinting papers, authors use an unequal number of participants between the first and second visit. This can be useful when adding more participants in the second visit for instance, as it adds more potential noise in the analysis, and reinforces that when identification works, it really works.

However, for simplicity, the code currently only offers fingerprinting to participants with both visits. This could change in the future depending on the interest.

2. Fingerprinting tabular data

And we’re already ready to fingerprint the data!

The next function only requires the two dataframes we cleaned with the first function and the prefix used to mark the columns we want to keep for fingerprinting.

similarity_matrix = fp.fingerprint_tabs(data1=data_bl, data2=data_fu, pref='ctx') #This should take around 20 seconds on sihnpy's data

Participant 1 / 234
Participant 2 / 234
Participant 3 / 234
Participant 4 / 234
Participant 5 / 234
Participant 6 / 234
Participant 7 / 234
Participant 8 / 234
Participant 9 / 234
Participant 10 / 234
Participant 11 / 234
Participant 12 / 234
Participant 13 / 234
Participant 14 / 234
Participant 15 / 234
Participant 16 / 234
Participant 17 / 234
Participant 18 / 234
Participant 19 / 234
Participant 20 / 234
Participant 21 / 234
Participant 22 / 234
Participant 23 / 234
Participant 24 / 234
Participant 25 / 234
Participant 26 / 234
Participant 27 / 234
Participant 28 / 234
Participant 29 / 234
Participant 30 / 234
Participant 31 / 234
Participant 32 / 234
Participant 33 / 234
Participant 34 / 234
Participant 35 / 234
Participant 36 / 234
Participant 37 / 234
Participant 38 / 234
Participant 39 / 234
Participant 40 / 234
Participant 41 / 234
Participant 42 / 234
Participant 43 / 234
Participant 44 / 234
Participant 45 / 234
Participant 46 / 234
Participant 47 / 234
Participant 48 / 234
Participant 49 / 234
Participant 50 / 234
Participant 51 / 234
Participant 52 / 234
Participant 53 / 234
Participant 54 / 234
Participant 55 / 234
Participant 56 / 234
Participant 57 / 234
Participant 58 / 234
Participant 59 / 234
Participant 60 / 234
Participant 61 / 234
Participant 62 / 234
Participant 63 / 234
Participant 64 / 234
Participant 65 / 234
Participant 66 / 234
Participant 67 / 234
Participant 68 / 234
Participant 69 / 234
Participant 70 / 234
Participant 71 / 234
Participant 72 / 234
Participant 73 / 234
Participant 74 / 234
Participant 75 / 234
Participant 76 / 234
Participant 77 / 234
Participant 78 / 234
Participant 79 / 234
Participant 80 / 234
Participant 81 / 234
Participant 82 / 234
Participant 83 / 234
Participant 84 / 234
Participant 85 / 234
Participant 86 / 234
Participant 87 / 234
Participant 88 / 234
Participant 89 / 234
Participant 90 / 234
Participant 91 / 234
Participant 92 / 234
Participant 93 / 234
Participant 94 / 234
Participant 95 / 234
Participant 96 / 234
Participant 97 / 234
Participant 98 / 234
Participant 99 / 234
Participant 100 / 234
Participant 101 / 234
Participant 102 / 234
Participant 103 / 234
Participant 104 / 234
Participant 105 / 234
Participant 106 / 234
Participant 107 / 234
Participant 108 / 234
Participant 109 / 234
Participant 110 / 234
Participant 111 / 234
Participant 112 / 234
Participant 113 / 234
Participant 114 / 234
Participant 115 / 234
Participant 116 / 234
Participant 117 / 234
Participant 118 / 234
Participant 119 / 234
Participant 120 / 234
Participant 121 / 234
Participant 122 / 234
Participant 123 / 234
Participant 124 / 234
Participant 125 / 234
Participant 126 / 234
Participant 127 / 234
Participant 128 / 234
Participant 129 / 234
Participant 130 / 234
Participant 131 / 234
Participant 132 / 234
Participant 133 / 234
Participant 134 / 234
Participant 135 / 234
Participant 136 / 234
Participant 137 / 234
Participant 138 / 234
Participant 139 / 234
Participant 140 / 234
Participant 141 / 234
Participant 142 / 234
Participant 143 / 234
Participant 144 / 234
Participant 145 / 234
Participant 146 / 234
Participant 147 / 234
Participant 148 / 234
Participant 149 / 234
Participant 150 / 234
Participant 151 / 234
Participant 152 / 234
Participant 153 / 234
Participant 154 / 234
Participant 155 / 234
Participant 156 / 234
Participant 157 / 234
Participant 158 / 234
Participant 159 / 234
Participant 160 / 234
Participant 161 / 234
Participant 162 / 234
Participant 163 / 234
Participant 164 / 234
Participant 165 / 234
Participant 166 / 234
Participant 167 / 234
Participant 168 / 234
Participant 169 / 234
Participant 170 / 234
Participant 171 / 234
Participant 172 / 234
Participant 173 / 234
Participant 174 / 234
Participant 175 / 234
Participant 176 / 234
Participant 177 / 234
Participant 178 / 234
Participant 179 / 234
Participant 180 / 234
Participant 181 / 234
Participant 182 / 234
Participant 183 / 234
Participant 184 / 234
Participant 185 / 234
Participant 186 / 234
Participant 187 / 234
Participant 188 / 234
Participant 189 / 234
Participant 190 / 234
Participant 191 / 234
Participant 192 / 234
Participant 193 / 234
Participant 194 / 234
Participant 195 / 234
Participant 196 / 234
Participant 197 / 234
Participant 198 / 234
Participant 199 / 234
Participant 200 / 234
Participant 201 / 234
Participant 202 / 234
Participant 203 / 234
Participant 204 / 234
Participant 205 / 234
Participant 206 / 234
Participant 207 / 234
Participant 208 / 234
Participant 209 / 234
Participant 210 / 234
Participant 211 / 234
Participant 212 / 234
Participant 213 / 234
Participant 214 / 234
Participant 215 / 234
Participant 216 / 234
Participant 217 / 234
Participant 218 / 234
Participant 219 / 234
Participant 220 / 234
Participant 221 / 234
Participant 222 / 234
Participant 223 / 234
Participant 224 / 234
Participant 225 / 234
Participant 226 / 234
Participant 227 / 234
Participant 228 / 234
Participant 229 / 234
Participant 230 / 234
Participant 231 / 234
Participant 232 / 234
Participant 233 / 234
Participant 234 / 234

similarity_matrix

array([[0.99939915, 0.9668774 , 0.97540227, ..., 0.95741285, 0.96756766,
        0.96571496],
       [0.9668774 , 0.99961408, 0.98778711, ..., 0.9787953 , 0.97243421,
        0.98021875],
       [0.97540227, 0.98778711, 0.99907875, ..., 0.97767822, 0.98302374,
        0.9825117 ],
       ...,
       [0.95741285, 0.9787953 , 0.97767822, ..., 0.99862623, 0.96710599,
        0.97997632],
       [0.96756766, 0.97243421, 0.98302374, ..., 0.96710599, 0.9996869 ,
        0.97899449],
       [0.96571496, 0.98021875, 0.9825117 , ..., 0.97997632, 0.97899449,
        0.99953386]])

Ok so what did we do? Just like in the matrix-like fingerprinting, we created a similarity matrix where the diagonal is the self-identifiability (i.e., within-individual correlation) and the off-diagonal elements are all the of the others-identifiabilities (i.e., between-individual correlation).

I discuss more on the different aspect of this method in the section on fingerprinting matrix-like data. Note that contrary to the fingerprinting on matrix-like data, the script for tabular data is more limited at the moment: it doesn’t offer normalization of the data, it doesn’t offer the selection of specific columns (like selecting nodes in the other version) and it doesn’t offer correlation with anything else but Pearson correlations. The reasons for this is simply that, by design, the tabular version of the fingerprinting should really be used with smaller data (e.g., structural data in a small set of parcels). That said, should you be interested in these additions, let me know by opening an issue on Github.

And that’s it! We’re already ready to compute the measures.

Fix: Possible errors

By default, the script will output two possible errors:

In the case where the participants in the first dataframe (i.e., the index) are different from the participants in the second dataframe
In the case where the columns of the first dataframe are different from the columns of the second dataframe

In both cases, this is really dependant on the data input into sihnpy, so you have to be careful. If the data is not working, then you need to reformat the data from before the first step.

3. Computing fingerprinting metrics

The last step needed is to compute the fingerprint metrics (accuracy, self-identifiability, others-identifiability and differential identifiability).This is also quite simple, as you just need to provide a dataframe (sihnpy will grab the index of it), the similarity matrix we computed in the previous step.

fp_metrics = fp.tab_metrics_calc(data=data_bl, similar_matrix=similarity_matrix, name='tutorial')
fp_metrics

	si_tutorial	oi_tutorial	fia_tutorial	di_tutorial
participant_id
sub-1004359	0.999399	0.973846	1.0	0.025554
sub-1016072	0.999614	0.977725	1.0	0.021889
sub-1072774	0.999079	0.980443	1.0	0.018636
sub-1076159	0.997419	0.980987	1.0	0.016433
sub-1154932	0.999634	0.977076	1.0	0.022558
...	...	...	...	...
sub-9865768	0.999506	0.977502	1.0	0.022003
sub-9889544	0.999228	0.978876	1.0	0.020352
sub-9909448	0.998626	0.971501	1.0	0.027125
sub-9931234	0.999687	0.974556	1.0	0.025131
sub-9939055	0.999534	0.981042	1.0	0.018492

234 rows × 4 columns

fp_metrics.iloc[233,3]

0.018491580529617635

In order, the script outputs the self-identifiability (si), the others-identifiability (oi), the fingerprint identification accuracy (fia) and the differential identifiability (di).

si, oi and di will always range between 0 and 1 as they are correlation coefficients. fia will be either 0 or 1, where 1 represents an accurate identification of the participant. oi is the average of all between-individual correlations for a specific individual (i.e., on average, how similar is a participant to the rest of the cohort).

The name argument in the function is for the user to add a name at the end. This is particularly useful if you run multiple fingerprinting analyses so you can distinguish them.

Let’s see what we got in our sample!

print(f"Total fingerprinting accuracy is: {round((fp_metrics['fia_tutorial'].sum() / len(fp_metrics)) * 100, 2)}%")

Total fingerprinting accuracy is: 98.29%

Wow we did pretty well!

4. Exporting the results

We’re already at the end! Time flies by in good company (I hope).

The last step is simply to export the data. This is done by calling the following function:

fp.tab_export("/path/to/output", data1=data_bl, data2=data_fu, similar_matrix=similarity_matrix, fp_metrics=fp_metrics, name='test')

It requires: 1) The path to where the files should be stored, 2) the first dataset (with the first visit), 3) the second dataset (with the second visit), 4) the similarity matrix and 5) the fingerprint metrics. Every one of these elements are output to file at the location specified by the user.

Conclusion

You made it through the whole fingerprinting tutorial! (or well, you skipped ahead to here) I hope I was able to make the steps clear for you and that you enjoyed following along. If things weren’t clear in the documentation, please submit an issue on Github.

Don’t forget to cite the package and the paper describing this method if you end up using it in one of your paper!

tl;dr

Got bored during the tutorial? You already finished the tutorial and just want a quick reminder of the main functions you need? Or you just want to bash ahead with the code without reading? I got you. Here’s a condensed form of the code:

from sihnpy import datasets
from sihnpy import fingerprinting as fp

volume_data = datasets.pad_fptab_input()[0]

data_bl, data_fu = fp.import_fingerprint_data(data=volume_data, var='session')

similarity_matrix = fp.fingerprint_tabs(data1=data_bl, data2=data_fu, suff='ctx')

fp_metrics = fp.tab_metrics_calc(data=data_bl, similar_matrix=similarity_matrix, name='test')

fp.tab_export("/path/to/output", data1=data_bl, data2=data_fu, similar_matrix=similarity_matrix, fp_metrics=fp_metrics, name='test')

References

You can find references for this topic in the main introduction on fingerprinting here.