Fingerprinting analysis - Matrix-like data: Step-by-step

Ok, it’s all great to know about the rationale and everything behind fingerprinting, but now let’s get to the fun part of it: how do we actually use the module?

Here, I demonstrate, step-by-step, how to run the fingerprinting analysis. You can follow along once you have installed sihnpy and opened a Jupyter Notebook.

Note

Note that the functions described on this page will only work with matrix-like data, as is common for functional or structural connectivity. If you want to use fingerprinting with tabular data (one row of data per participant), sihnpy offers a different set of functions for that purpose.

Specifically, the functions here only accept two paths containing matrices that you want to fingerprint together.

Already read the tutorial before and you just want the code (a.k.a. too long; didn’t read)? Head on out to the tl;dr section.

1. Preparing the data

To run a fingerprinting module for matrix-like data, we need three things:

  • The path to a list of participants to analyze

  • The path to the folder containing the matrices of the first session of brain imaging

  • The path to the folder containing the matrices of the second session brain imaging

If you already have the above for your data, you can skip ahead to the next section. Otherwise, sihnpy also offers a small subset of data from the Prevent-AD cohort. Remember that by using the Prevent-AD data to practice to use the code, you agree to the terms of use. You can access the data using the code below:

from sihnpy.datasets import pad_fp_input

id_list, path_participant_list, path_data_fp = pad_fp_input()

As we discuss in the section on using Prevent-AD data, the id_list variable (or whatever you want to call it) contains the IDs and the basic information on participants included in the dataset.

id_list #Or use print(id_list) if you are not using a Jupyter Notebook
participant_id sex test_language handedness_score handedness_interpretation
0 sub-1000173 Male French 100 Right-handed
1 sub-1002928 Female French 100 Right-handed
2 sub-1004359 Female French 90 Right-handed
3 sub-1016072 Female French -100 Left-handed
4 sub-1031654 Male French 100 Right-handed
5 sub-1072774 Female French 100 Right-handed
6 sub-1076159 Female French 100 Right-handed
7 sub-1121981 Female French 100 Right-handed
8 sub-1154932 Male French 30 Ambidextrous
9 sub-1176949 Female French 80 Right-handed
10 sub-1177880 Female French 100 Right-handed
11 sub-1263509 Female French 100 Right-handed
12 sub-1283278 Female English 80 Right-handed
13 sub-1284264 Female French 80 Right-handed
14 sub-1322140 Female French 100 Right-handed

For the fingerprinting analysis, the rest of the variables in the dataframe above are not really important for now. The only important variable is the column participant_id, which should always be the first column in the dataset (though the header of the column can be name whatever you feel like). We will also need the variable path_participant_list for the fingerprinting. The id_list given by pad_fp_input() is more for the benefit of the user to check quickly who are the participants included (i.e., if you want to make sure you know who is included).

Next, we’ll take a look at the variable path_data_fp. This one is specific to the fingerprinting module.

path_data_fp
{'BL00': {'rest_run1': '/home/docs/checkouts/readthedocs.org/user_builds/sihnpy/envs/latest/lib/python3.9/site-packages/sihnpy/data/pad_conp_minimal/BL00/rest_run1',
  'rest_run2': '/home/docs/checkouts/readthedocs.org/user_builds/sihnpy/envs/latest/lib/python3.9/site-packages/sihnpy/data/pad_conp_minimal/BL00/rest_run2',
  'encoding': '/home/docs/checkouts/readthedocs.org/user_builds/sihnpy/envs/latest/lib/python3.9/site-packages/sihnpy/data/pad_conp_minimal/BL00/encoding',
  'retrieval': '/home/docs/checkouts/readthedocs.org/user_builds/sihnpy/envs/latest/lib/python3.9/site-packages/sihnpy/data/pad_conp_minimal/BL00/retrieval'},
 'FU12': {'rest_run1': '/home/docs/checkouts/readthedocs.org/user_builds/sihnpy/envs/latest/lib/python3.9/site-packages/sihnpy/data/pad_conp_minimal/FU12/rest_run1',
  'rest_run2': '/home/docs/checkouts/readthedocs.org/user_builds/sihnpy/envs/latest/lib/python3.9/site-packages/sihnpy/data/pad_conp_minimal/FU12/rest_run2',
  'encoding': '/home/docs/checkouts/readthedocs.org/user_builds/sihnpy/envs/latest/lib/python3.9/site-packages/sihnpy/data/pad_conp_minimal/FU12/encoding',
  'retrieval': '/home/docs/checkouts/readthedocs.org/user_builds/sihnpy/envs/latest/lib/python3.9/site-packages/sihnpy/data/pad_conp_minimal/FU12/retrieval'}}

For this specific module, we get a nested dictionaries with paths to the functional connectivity matrices which we will use to launch the fingerprinting.

Let’s unpack this a bit. For those not familiar with dictionaries you might feel like “Woah! What’s all that weird output??”. I won’t go into too much details, but the idea is that we create “dictionary entries”, where each key (before the colon) has a value (after the colon) for each path to the connectivity matrices. These entries tell us what is the path we should use for the fingerprinting, depending on what interest us.

For this fingerprinting practice, we will use the functional connectivity matrices derived from the resting state (run 1) at baseline and the functional connectivity matrices derived from the resting state (run 1) 12 months later. Looking at our dictionary, we see two higher level entries BL00 and FU12, respectively meaning baseline and 12-months follow-up. We see that both BL00 and FU12 have nested dictionaries with a key named rest_run1, and two paths pointing to different directories.

Tip

Feel free to use any of the paths provided while you are practicing. Do you see any differences in the results?

Once we understand the dictionary, it becomes really easy to get the path we are interested in. The syntax is simply dictionary_name[key_level1][key_level2]:

print(path_data_fp['BL00']['rest_run1'])
print(path_data_fp['FU12']['rest_run1'])
/home/docs/checkouts/readthedocs.org/user_builds/sihnpy/envs/latest/lib/python3.9/site-packages/sihnpy/data/pad_conp_minimal/BL00/rest_run1
/home/docs/checkouts/readthedocs.org/user_builds/sihnpy/envs/latest/lib/python3.9/site-packages/sihnpy/data/pad_conp_minimal/FU12/rest_run1

Easy as pie! We now have the paths we need to launch the fingerprinting.

Tip

The paths you get on your own computer might look a lot weirder than the ones you see above, depending on where sihnpy is installed in your Python installation. However, they should be easily accessible on our own computer. You can test it out by doing the following in your terminal:

# Replace the path below by the path sihnpy is telling you the data is in
$ ls ~/Desktop/sihnpy/src/sihnpy/data/pad_conp_minimal/BL00/rest_run1

Consequently, since they are accessible, it also means you can take a look at the individual matrices should you wish to do so. I do love to look at a good connectivity matrix.

2. Importing the data for fingerprinting

Warning

If you skipped ahead section 1 (or if you did the tutorial and are now using your own data), you need to make sure of a couple of very important things before you start

  1. sihnpy will accept the file with participant IDs if it is in .csv, .tsv or is a .txt file (with or without the .txt extension). However, it expects the file to be formatted in columns, and to have a column header (that can be named whatever you want). Otherwise, it might drop the first participant in the list.

  2. The participant IDs column MUST be the first column. Otherwise, whatever is first will be selected as IDs

  3. The integral participant IDs in the file MUST be present in the file name corresponding to that participant. For instance, if in the list of participant you have sub-666, then sub-666 must also be in the filename (e.g., sub-666_whatever_modality).

  4. There should not be exact duplicates within each modality. For instance, there should not be two files for participants sub-666 in the folder for the first modality. Otherwise, sihnpy will throw an error.

The first step in running the fingerprinting module is to import the package we need.

from sihnpy import fingerprinting #That's the only module we need here

Next, we need to import the participant ids. At this stage, we just need the path to the file containing the IDs.

list_of_ids = fingerprinting.import_fingerprint_ids(path_participant_list) 
print(list_of_ids)
['sub-1000173', 'sub-1002928', 'sub-1004359', 'sub-1016072', 'sub-1031654', 'sub-1072774', 'sub-1076159', 'sub-1121981', 'sub-1154932', 'sub-1176949', 'sub-1177880', 'sub-1263509', 'sub-1283278', 'sub-1284264', 'sub-1322140']

What happened here is that:

  1. sihnpy imported the .tsv containing the IDs of the participants (using pandas)

  2. It then extracted the first column (participant_id)

  3. And converted the values in a simple list

We see that we have our 15 participants, as expected.

3. Create a “fingerprinting object”

For the next step, sihnpy needs the list of IDs we just created and the paths to the functional connectivity matrices. To facilitate some internal processing (or well… because at the time I thought it was the best way to code this), sihnpy creates a python class (i.e., object-oriented programming). You don’t really need to understand object-oriented programming here, but it’s more to give you context.

The code here is quite straightforward:

fp_mats = fingerprinting.FingerprintMats(list_of_ids, path_data_fp['BL00']['rest_run1'], path_data_fp['FU12']['rest_run1'])

What we get is an object called fp_mats (the name doesn’t matter much, you can call it what you want as long as you keep using the same name going forward). The object contains the original variables we gave it to start:

print(fp_mats.id_ls)
print(fp_mats.path_m1)
print(fp_mats.path_m2)
['sub-1000173', 'sub-1002928', 'sub-1004359', 'sub-1016072', 'sub-1031654', 'sub-1072774', 'sub-1076159', 'sub-1121981', 'sub-1154932', 'sub-1176949', 'sub-1177880', 'sub-1263509', 'sub-1283278', 'sub-1284264', 'sub-1322140']
/home/docs/checkouts/readthedocs.org/user_builds/sihnpy/envs/latest/lib/python3.9/site-packages/sihnpy/data/pad_conp_minimal/BL00/rest_run1
/home/docs/checkouts/readthedocs.org/user_builds/sihnpy/envs/latest/lib/python3.9/site-packages/sihnpy/data/pad_conp_minimal/FU12/rest_run1

As you can see, if we look at variables in the object, we find the variables we fed to the function. We’re ready to move to the next step.

Hint

As you probably noticed here, I used the dictionary from the Prevent-AD data that we saw in the first section of this tutorial, which refers to a specific string.

When running the function on your own data, simply replace path_data_fp['BL00']['rest_run1'] and path_data_fp['FU12']['rest_run1'] by the paths to the functional connectivity matrices on your computer.

4. File and subject selection

We are all setup to start actually getting the functional connectivity matrices from the files. In this step, we are going to import all the names of the matrices we have, and match them to our list of participant IDs so that when we run the fingerprinting we can import individual matrices to be correlated.

So first, import the names of the matrices. This function does not require any argument.

files_m1, files_m2 = fp_mats.fetch_matrix_file_names()
print(files_m1) #Print the file names of the first modality
print(files_m2) #Print the file names of the second modality
['sub-1002928_ses-BL00_task-rest_run-1.tsv', 'sub-1016072_ses-BL00_task-rest_run-1.tsv', 'sub-1154932_ses-BL00_task-rest_run-1.tsv', 'sub-1000173_ses-BL00_task-rest_run-1.tsv', 'sub-1176949_ses-BL00_task-rest_run-1.tsv', 'sub-1004359_ses-BL00_task-rest_run-1.tsv', 'sub-1121981_ses-BL00_task-rest_run-1.tsv', 'sub-1284264_ses-BL00_task-rest_run-1.tsv', 'sub-1076159_ses-BL00_task-rest_run-1.tsv', 'sub-1322140_ses-BL00_task-rest_run-1.tsv', 'sub-1031654_ses-BL00_task-rest_run-1.tsv', 'sub-1263509_ses-BL00_task-rest_run-1.tsv', 'sub-1283278_ses-BL00_task-rest_run-1.tsv', 'sub-1177880_ses-BL00_task-rest_run-1.tsv', 'sub-1072774_ses-BL00_task-rest_run-1.tsv']
['sub-1177880_ses-FU12_task-rest_run-1.tsv', 'sub-1076159_ses-FU12_task-rest_run-1.tsv', 'sub-1154932_ses-FU12_task-rest_run-1.tsv', 'sub-1072774_ses-FU12_task-rest_run-1.tsv', 'sub-1176949_ses-FU12_task-rest_run-1.tsv', 'sub-1004359_ses-FU12_task-rest_run-1.tsv', 'sub-1000173_ses-FU12_task-rest_run-1.tsv', 'sub-1322140_ses-FU12_task-rest_run-1.tsv', 'sub-1284264_ses-FU12_task-rest_run-1.tsv', 'sub-1016072_ses-FU12_task-rest_run-1.tsv']

Ok great, we have the names of the files.

Hint

Do you notice something is different between the two lists? I’ll get back to that in a second.

Once we have these lists, we intersect it with our subject list. This will confirm how many participants we will keep in the end (i.e., how many people have both modalities). This function requires the two variables files_m1 and files_m2 that we created earlier (or however you names them).

sub_final, final_m1, final_m2 = fp_mats.subject_selection(files_m1=files_m1, files_m2=files_m2, verbose=True)
We have 15 subjects in the list.
We have in total 15 participants in modality 1 & 10 participants in modality 2.
A total of 10 have both modalities. Only these are used.
['sub-1000173', 'sub-1004359', 'sub-1016072', 'sub-1072774', 'sub-1076159', 'sub-1154932', 'sub-1176949', 'sub-1177880', 'sub-1284264', 'sub-1322140']

Now do you see the difference between the first and second list of files? There are participants missing! This can happen for a bunch of reasons, but in the case of the Prevent-AD it can happen because participants simply didn’t come back for a follow-up or that at the time of the data freeze, these participants had not yet received their 12-month follow-up.

Thankfully, sihnpy is ready. Under the hood, it checks what is the intersection between the list of IDs we give it, the files of the first modality and the files of the second modality, and it finds the participants that come back across both modalities.

Note that if you don’t want sihnpy to tell you all that information, you can turn it off by setting verbose to False.

Important

Also note that it if you did not heed the warnings I made earlier when we prepared the data, it is possible that the number of participants sihnpy will give you will be incorrect, that sihnpy will not find any files or that sihnpy will throw an error.

5. Fingerprinting

We are finally there! The best part of the fingerprinting module is… well… the fingerprinting function. The function is quite straightforward to run, but there is an important detail that needs to be added, which is, what nodes you want to select to run the fingerprinting.

As I mention in the fingerprinting introduction, an advantage of this script is that you can directly specify which part of the functional connectivity matrices you want to use. It’s a bit unpolished, but the general idea is that sihnpy accepts a list of integer values, where 0 is the first node and n-1, where n is the total number of nodes, is the last node (remember here that Python is 0-indexed, which is why the first node is 0).

In the Prevent-AD data, the connectivity matrices are based on the Schaefer atlas (400 nodes). More info on how the data was preprocessed here. For a first pass, let’s simply use all the nodes to compute the fingerprinting. To generate the list we simply need to write list(range(0,400)), which will generate a list ranging from 0 to 400:

[0, 1, 2, 3... 397, 398, 399]

Let’s run the code!

similarity_matrix = fp_mats.fingerprint_mats(nodes_index_within=list(range(0,400)), norm=True, corr_type="Pearson", verbose=True)
Working on participant 1: sub-1000173
Working on participant 2: sub-1004359
Working on participant 3: sub-1016072
Working on participant 4: sub-1072774
Working on participant 5: sub-1076159
Working on participant 6: sub-1154932
Working on participant 7: sub-1176949
Working on participant 8: sub-1177880
Working on participant 9: sub-1284264
Working on participant 10: sub-1322140

The output of this function is a similarity matrix, where each cell represents a correlation between the functional connectivity patterns of two participants. The diagonal of the matrix is the correlation within each participant (so the matrices of the same participant over time in this case) and the off diagonal elements are the correlation between each other participants. See below the matrix we get in our case.

similarity_matrix
array([[0.25927082, 0.14739691, 0.13083276, 0.13045085, 0.13452714,
        0.15191495, 0.15016591, 0.13866468, 0.13834814, 0.13457094],
       [0.14739691, 0.20955892, 0.12416142, 0.12060363, 0.1355009 ,
        0.13762913, 0.14931803, 0.13266136, 0.13901744, 0.13040703],
       [0.13083276, 0.12416142, 0.20563803, 0.13469094, 0.13910196,
        0.15017463, 0.16069675, 0.14068546, 0.136542  , 0.13224686],
       [0.13045085, 0.12060363, 0.13469094, 0.22875025, 0.13224248,
        0.14234101, 0.15750932, 0.13240777, 0.12790694, 0.12975919],
       [0.13452714, 0.1355009 , 0.13910196, 0.13224248, 0.22381991,
        0.14654018, 0.15156584, 0.13428944, 0.13406215, 0.13890551],
       [0.15191495, 0.13762913, 0.15017463, 0.14234101, 0.14654018,
        0.27093165, 0.16421014, 0.13795648, 0.13711563, 0.13150307],
       [0.15016591, 0.14931803, 0.16069675, 0.15750932, 0.15156584,
        0.16421014, 0.28130816, 0.14959845, 0.15347509, 0.15238234],
       [0.13866468, 0.13266136, 0.14068546, 0.13240777, 0.13428944,
        0.13795648, 0.14959845, 0.23916755, 0.15987529, 0.14608581],
       [0.13834814, 0.13901744, 0.136542  , 0.12790694, 0.13406215,
        0.13711563, 0.15347509, 0.15987529, 0.24795559, 0.14822253],
       [0.13457094, 0.13040703, 0.13224686, 0.12975919, 0.13890551,
        0.13150307, 0.15238234, 0.14608581, 0.14822253, 0.24015393]])

Ok so it’s not the most straight-forward output to understand, so let me put it differently:

../_images/similarity_matrix_example.png

In the image above, you see the same matrix as above, but this time, with the participant IDs and some color. The diagonal (in green) represents the correlations within the same participants while the off-diagonal elements (in orange) represent the correlations between participants.

For example, the first green cell at the top left, cell B2, is the correlation between the functional connectivity matrix at baseline during rest and the functional connectivity matrix at the 12-month follow-up during rest for participant sub-1000173. The orange cell C2 is the correlation between the functional connectivity matrix of participant sub-1000173 at baseline during rest and the functional connectivity matrix at the 12-month of participant sub-1004359 follow-up during rest.

One thing you will probably notice is that the matrix is symmetric, meaning that the top part is a mirror of the bottom part. For instance, cell B3 is the same as cell C2.

I’ll get more into the specifics of how we compute the different metrics I mention in the fingerprinting introduction in the next section. But, in a nutshell, the hard step is done and you’ve effectively fingerprinted functional connectivity! Congrats!

Intermediate topic: Normalization and correlation options

Normalization

You probably noticed above that we have an argument called norm in the fingerprint_mats function. The idea is to normalize the functional connectivity matrices so that any issues due to the data can be resolved. In the original literature by Finn et al.1, this is done by using the Fisher normalization, which was adapted in this script.

To date however, there is no consensus (to my knowledge) on whether this normalization should be done. In the Prevent-AD data included in sihnpy, there was very marginal differences when using normalization or not. By default, normalization is done. If you find out better ways to normalize the data, or if you have an answer for whether a consensus exists for this type of normalization, feel free to let me know by opening an issue!

Correlation

In most of the literature to date, the correlation between functional connectivity matrices to get the fingerprint measures is done using Pearson correlations (or product-moment correlation). However, others have used different methods such as Spearman correlations.2 Currently, Pearson correlation is the only implemented method in sihnpy, but other methods can be implemented quite easily as needed. Request it on Github by opening an issue!

Advanced topic: Selecting specific nodes (within- and between-network)

Selecting nodes - Within-network

An interesting point from multiple paper on fingerprinting is that the number of nodes, the functional network used and the localization of the nodes all seem to play a role in how well fingerprinting works.1,3,[^St_Onge_2023]

As a first pass, I definitely encourage you to take all the nodes available like in the example. It simplifies the analyses a lot and gives you a good idea of whether fingerprinting works in your sample or not. Most of the time, using all nodes available should result in high fingerprinting capacity. 1,[^St_Onge_2023]

In sihnpy, you can specify which nodes you want to use by using a list of integers to identify their position on a connectivity matrix. When extracting functional connectivity with nilearn using the Schaefer atlas, nilearn outputs labels for every node of the functional connectivity matrix. For instance, the visual network spans nodes 0 to 30 and 200 to 229 inclusively (again, Python is 0-indexed, so everything starts at 0 instead of 1, and the last number of a list is not included in the list; a bit confusing I know). If you wanted to use these nodes only, you could create a variable in advance, and use that variable in the fingerprintint_mats method:

>>> visual_net_nodes = list(range(0, 31)) + list(range(200, 230))
>>> similarity_matrix = fp_mats.fingerprint_mats(nodes_index_within=visual_net_nodes, norm=True, corr_type="Pearson", verbose=True)

The code above will restrict the matrices to the 59 nodes in the visual network by selecting specifically the nodes we identify. A fun thing with this is that it is really flexible, and should work across parcellations. One thing we did in the paper[^St_Onge_2023] was check whether a random collection of nodes also led to good fingerprinting. You can do something similar with just a few lines:

>>> np.random.seed(667) #Set a seed for reproducibility of the random array
>>> random_net_array = list(np.sort(np.random.randint(0,400,91))) #Give 91 random numbers between 0 and 400
>>> similarity_matrix = fp_mats.fingerprint_mats(nodes_index_within=random_net_array, norm=True, corr_type="Pearson", verbose=True)

However, this method does come with the drawback that you need to make sure the nodes in the matrix are indeed organized in the order you are expecting and you need to make sure that the nodes you are selecting are the right ones. The script won’t be able to tell you whether you selected the nodes correctly or not.

Selecting between-network edges

All of the literature before our paper[^St_Onge_2023] usually focused on using within-network edges in the matrices to do functional connectivity. But within-network edges only represent a small portion of edges; between-network edges (i.e., connections between different networks) are more numerous.

In our results, we found that fingerprinting using these edges works even better than within-network edges. And you can test that out for yourself too using the nodes_between_index argument in fingerprint_mats! But how do we select between-network edges? First, a quick, very simplified illustration of within- and between-network edges.

../_images/between_network.png

In this example, we imagine three brain networks: the visual, the default mode (DMN) and the frontoparietal (FP) networks, each containing a single node (i.e., 1 row per network). This is a very much simplified example, but really just to illustrate the idea.

Each square in the matrix is representing an edge (i.e., a link between two nodes). When I write about within-network edges, I refer to the blue squares. For example, the first blue square at the top left corner represents the link within the visual network nodes. However, if I want to talk about between-network edges, I mean all the other colored squares that aren’t blue. For instance, the between-network edges between the visual network and the default-mode and frontoparietal networks are illustrated in orange, in the top row. It represents the link between the visual and the two other networks.

Once you understand that, it becomes easy to adapt it in sihnpy for your own needs: sihnpy simply needs to know what are the indices of the rows (nodes_within_index) of the matrix we want and the indices of the columns we need (nodes_between_index).

When we give sihnpy only the nodes_within_index argument, it assumes that we want the same rows and columns (i.e., within-network), and that the sub-matrix we are selecting is symmetric. In numpy notation, this is equivalent to matrix[nodes_within_index][:,nodes_within_index]. In our toy example, it would grab only the blue squares. However, in a real life example, with connectivity data, it will discard the diagonal and the lower triangle before computing the fingerprinting.

On the other hand, if we give the nodes_between_index argument, sihnpy will grab the specific sub-matrix we specify. In numpy notation, the two arguments I mention roughly translate to matrix[nodes_within_index][:,nodes_between_index]. In our toy example, this should only grab the orange square if we are interested in the visual network.

Overall, this application of the method allows for a lot of flexibility, because you can select virtually any between-network edges you want. For instance, you could specifically request edges between the visual and default-mode network only to do the fingerprinting.

Warning

With great flexibility comes great danger.

sihnpy won’t be able to tell you whether the nodes you selected are the right ones. So if you forget a node or include a node that shouldn’t have been included, sihnpy will still proceed ahead and it will affect your results. This is further complicated by the 0-indexed Python notation (i.e., the “5th” column is actually index 4), which means that human error is also likely, particularly for those not used to Python.

When using between-network edges, sihnpy also doesn’t remove duplicated edges. This is because, technically, there shouldn’t be any duplicated edges since we aren’t looking at within-network edges. If your indices for rows (nodes_within_index) and for columns (nodes_between_index) overlap, you will include duplicate edges in your analysis, which will affect your results.

6. Fingerprinting metrics

Almost all done! The next step is to extract the fingerprinting metrics. Thankfully, sihnpy comes with a function that does it for you: no need to do anything! Isn’t it great?

The function only takes two arguments: the similarity matrix you just computed and a “name” and it returns a pandas.DataFrame with the relevant information computed. The “name” argument is simply a string that will be added to the name of the columns and can be anything you like: it’s really just for you to keep track of what did you fingerprint with.

coef_data = fp_mats.fp_metrics_calc(similar_matrix=similarity_matrix, name='tutorial')
coef_data
si_tutorial oi_tutorial fia_tutorial di_tutorial
ID
sub-1000173 0.259271 0.139652 1.0 0.119618
sub-1004359 0.209559 0.135188 1.0 0.074370
sub-1016072 0.205638 0.138793 1.0 0.066845
sub-1072774 0.228750 0.134212 1.0 0.094538
sub-1076159 0.223820 0.138526 1.0 0.085294
sub-1154932 0.270932 0.144376 1.0 0.126556
sub-1176949 0.281308 0.154325 1.0 0.126984
sub-1177880 0.239168 0.141358 1.0 0.097809
sub-1284264 0.247956 0.141618 1.0 0.106337
sub-1322140 0.240154 0.138231 1.0 0.101922

In order, the script outputs the self-identifiability (si), the others-identifiability (oi), the fingerprint identification accuracy (fia) and the differential identifiability (di).

si, oi and di will always range between 0 and 1 as they are correlation coefficients. fia will be either 0 or 1, where 1 represents an accurate identification of the participant. oi is the average of all between-individual correlations for a specific individual (i.e., on average, how similar is a participant to the rest of the cohort).

Tip

For the “name” variable, you can use your own naming scheme. What I found most effective during the writing of the package for the paper was to use the following (here only used as an illustration if it can help you out):

mod1_mod2_edges_parcellation_network_correlationtype

mod1: name of the first modality/session used for fingerprinting mod2: name of the second modality/session used for fingerprinting edges: within or between network edges parcellation: name of the atlas used for fingerprinting (e.g., Schaefer) network: name of the network used (e.g., default-mode) correlationtype: whether I used partial correlation (pcor) or Pearson correlation (corr) to generate the functional connectivity matrices

7. Exporting the results

Ok so now we have all the results we need, but the results still only live within Python: we need to export them so we can use them for our awesome paper. Again, sihnpy does this very simply with a special made function. It requires you to provide 1) the full path where the folder will be created, 2) the name of the variable with the computed coefficients, 3) the name of the variable with the similarity matrix and 4) the “name” you want the folders and files to bear.

fp_mats.fp_mat_export("~/Desktop/test_output_fp", coef_data=coef_data, similar_matrix=similarity_matrix, name='tutorial', out_full=True, dir_struct=True)

Doing this, you will now have a directory within test_output_fp called tutorial (i.e., the name you are giving) looking like so:

tutorial
|___fp_metrics_tutorial.csv
|___similarity_matrices
    |___similarity_matrix_tutorial.csv
|___subject_list
    |___subject_list_tutorial.csv

The fp_metrics_tutorial.csv file is the .csv document containing the fingerprint metrics. In the subfolder similarity_matrices you will see the similarity matrix (i.e., the correlations between each pair of participant) and the subfolder subject_list contains the list of participants included in the fingerprinting run.

Most of the time, you will likely only need the .csv file with the computed metrics. However, if you want to check the correlation between participants (e.g., if you have twins or brother/sisters), you can use the similarity matrix to zoom in on specific correlations between people. To facilitate manipulations on the similarity matrix, you will notice that it bears no subject IDs. This is where the subject_list file comes into play: it holds the subject IDs, with no column header, in the same order as the similarity matrix.

Tip

By default, sihnpy will create sub-folders for the similarity matrices and subject_list. This is somewhat an artefact from how the code was built before, but also I do like the organization in separate folders. That said, sihnpy is also flexible. You can set the argument dir_struct to False to output the similarity_matrix and the subject_list in the same directory.

If you decide you don’t want the similarity_matrix or the subject_list, you can ask sihnpy to not output it by setting out_full to False.

Conclusion

You made it through the whole fingerprinting tutorial! (or well, you skipped ahead to here) I hope I was able to make the steps clear for you and that you enjoyed following along. If things weren’t clear in the documentation, please submit an issue on Github.

Don’t forget to cite the package and the paper describing this method if you end up using it in one of your paper!

If you want to learn more, I also discuss more advanced topic ahead.

tl;dr

Got bored during the tutorial? You already finished the tutorial and just want a quick reminder of the main functions you need? Or you just want to bash ahead with the code without reading? I got you. Here’s a condensed form of the code:

from sihnpy.datasets import pad_fp_input #Import data for fingerprinting
from sihnpy import fingerprinting

#Preparation
id_list, path_participant_list, path_data_fp = pad_fp_input() #Basic info, path to the basic info and path to the matrices. Start here if you want to you the Prevent-AD data
list_of_ids = fingerprinting.import_fingerprint_ids(path_participant_list) #Import the list of IDs in `sihnpy`. If you have your own data, you start here and replace `path_participant_list` by the path to your file with the IDs as the first column

#Fingerprinting initialization
fp_mats = fingerprinting.FingerprintMats(list_of_ids, path_data_fp['BL00']['rest_run1'], path_data_fp['FU12']['rest_run1']) #Replace second and third argument with your own paths if using your own data
files_m1, files_m2 = fp_mats.fetch_matrix_file_names() #Get name of files for both modalities
sub_final, final_m1, final_m2 = fp_mats.subject_selection(files_m1=files_m1, files_m2=files_m2, verbose=True) #Extract participant IDs and figure out who has data in both modalities

#Running the fingerprinting and computing the metrics
similarity_matrix = fp_mats.fingerprint_mats(nodes_index_within=list(range(0,400)), norm=True, corr_type="Pearson", verbose=True) #Takes a long time if running a lot of participants
coef_data = fp_mats.fp_metrics_calc(similar_matrix=similarity_matrix, name='tutorial') #Replace the name by whatever you prefer

#Export the data
fp_mats.fp_mat_export("~/Desktop/test_output_fp", coef_data=coef_data, similar_matrix=similarity_matrix, name='tutorial', out_full=True, dir_struct=True) #Replace the path by whatever path you prefer on your local computer.

Advanced topic: Command line-based script for fingerprinting and high performance computing

The script presented above is great when you have a few network or modalities to run. But what if you have a lot more? How do you organize it? In the coming weeks I will prepare a tutorial on how to write a command line script to be able to run the fingerprinting for many different modalities and options.

References

List of relevant references for this script. More info and key papers on fingerprinting are available in the intro.


1(1,2,3)

Finn et al. (2015). Nat Neuro. 10.1038/nn.4135

2

Ousdal et al. (2020). Hum Brain Mapp. 10.1002/hbm.24833

3

Amico et al. (2018). Sci Reports. 10.1038/s41598-018-25089-1