Sliding-window analysis

Let’s get down to it! This module is probably the easiest to use in sihnpy so you will be on your way very quickly.

Already read the tutorial before and you just want the code (a.k.a. too long; didn’t read)? Head on out to the tl;dr section.

Practice data

In other sihnpy modules, real data from a subset of the PREVENT-AD Open Dataset is used. While we could probably use the continuous handedness scale to create windows, there is not a lot of diversity in the numbers (so a lot of participants that are fully right-handed would be clumped together in most windows).

A variable that usually lends itself well to the sliding-window approach is age (and well… this is the variable that was originally used for this methodology). Age is not available in the PREVENT-AD Open Dataset as it is a restricted information. Instead, I opted to simulate age data for the PREVENT-AD participants. This is similar to the data available for the Spatial Extent. I will actually refer you to that section for more detailed information on how the age data was simulated in the PREVENT-AD as the data for the sliding-window was created in a very similar manner (though in the sliding-window we create a single Gaussian distribution).

Just know that the simulated age data matches the mean, standard deviation and inclusion criteria from the PREVENT-AD: mean age of 65 years, standard deviation of 5 years and minimum age to be included is 55 years.

Warning

Just like for the spatial extent module, sihnpy provides practice data to use the sliding-window module. While PREVENT-AD participants are used, the data is simulated. As a general rule for sihnpy, and especially for this module, only use the data provided to help you practice using the module, not to conduct or publish actual research.

Deriving the sliding windows

1. Preparing the data

To run the sliding-window module, you will need two things:

  • A spreadsheet with the data, with the index set as the participants’ IDs

  • The name of the variable we want to slide along

If you already have your data ready, you can skip ahead to the next section.

As mentioned before, sihnpy has data available for you to use to practice:

from sihnpy.datasets import pad_sw_input

pad_age_data = pad_sw_input()
pad_age_data
sex test_language handedness_score handedness_interpretation age
participant_id
sub-5458966 Male French 80.00 Right-handed 65.892657
sub-2424540 Female French 100.00 Right-handed 65.543026
sub-7855613 Female French 90.00 Right-handed 59.054610
sub-3137570 Male French 90.00 Right-handed 65.653705
sub-9650197 Female French 100.00 Right-handed 68.059713
... ... ... ... ... ...
sub-5336241 Female French -30.00 Ambidextrous 70.342373
sub-1002928 Female French 100.00 Right-handed 68.658707
sub-1283278 Female English 80.00 Right-handed 61.154117
sub-9101699 Male French 57.89 Right-handed 62.495973
sub-6261459 Male French 100.00 Right-handed 64.487385

308 rows × 5 columns

We have our dataset, which is basically the demographic information for our participants included in the PREVENT-AD Open Dataset with the simulated age. We’re ready to go!

Fix: Dataframe index

While importing the data from sihnpy is easy and already in the right format, it is critical that the data used in the rest of the functions have the final index you want to use. Contrary to R, pandas uses an Index method, where each row is referred to by a label. sihnpy will sort participants and output the results of each window with these labels.

In sihnpy in particular, we output the list of participants in each window (i.e., we extract and output the index of the dataframe) and we output the data of each window (i.e., we extract and output the data of each window based on their index). You can easily set an index by doing the following procedure when importing your data in Python:

import pandas as pd

data = pd.read_csv("/path/to/file.csv", index_col=0) #The number is the position (integer) of the column to be used as index

Otherwise, if you already have the data imported in Python, you can manually force the index to the variable you want:

import pandas as pd

data = pd.read_csv("/path/to/file.csv")
data_indexed = data.set_index('named_column') #Where `named_column` is the name of the column to use as index

2. Calculating the number of windows

The first step is to estimate how many windows we want to create. To ensure we don’t have empty windows, sihnpy will use your desired window and step size to compute the ideal number of windows. For a refresher on what is the window and the step size, you can refer to the introduction to the **sliding-window**.

For example, let’s say I would like a window size of 100 and step size of 20, then I would simply need to tell sihnpy:

from sihnpy import sliding_window as sw

n_windows_sane = sw.bins(data=pad_age_data, var='age', w_size=100, s_size=20)
Collapse is False: the last window may have a smaller number of participants
Number of windows: 12

You aren’t limited to how you want to divide the windows. So you can even use odd numbers if you like (not my preference cos I don’t like odd numbers… but if it fits your research design, great!):

n_windows_insane = sw.bins(data=pad_age_data, var='age', w_size=66, s_size=13)
Collapse is False: the last window may have a smaller number of participants
Number of windows: 20

The crucial information is really only to tell sihnpy which pandas.DataFrame it should be using (pad_age_data in our case) and the name of the variable it should use for sorting and figuring out the windows.

Warning

Missing values are not currently tolerated in the sliding-window module and will throw errors. Make sure that there are no missing values on your sorting variable. Future versions will allow users to choose whether to throw errors, put missing values first or last.

Advanced topic: Collapse argument

You might have noticed above that these is a message indicating that “Collapse is False”. What is that?

First, we need to understand how sihnpy computed the number of windows. The formula is as follows:

nwindows = ceiling((nsub - wsize) / stssize)

Where nwindows is the resulting number of windows, nsub is the number of participants in the whole sample, wsize is the size of the windows we want and ssize is the step size we want. In other words, the formula substracts the window size from the total number of participants, and divides the result by the step size. Because the numbers we choose can result in divisions with remainders, we force a ceiling rounding (rounding up) as we can’t have fractions of windows. This is also where the collapse argument comes into play. As the parameters we choose for the windows will almost never fall on a number of windows where the participants in each window all have the same number, sihnpy proposes to the user to choose how to deal with this.

In the default scenario (collapse=False), sihnpy will assume that we prefer having more windows, but the last window will have less participants. In this case, sihnpy will automatically add one more window to the number where remainding participants will be put in. In the other scenario (collapse=True), sihnpy will assume that we prefer having less windows, but the last window will have more participants. In this case, sihnpy will not add any window to the count, but the last window will have more participants.

It is possible that you choose a step size and window size that would ensure that there are no extra participants at the end. In such a case, please set collapse=True, as otherwise sihnpy will create an extraneous empty window.

In the end, choosing to collapse or to not collapse fall unto the user, but I don’t think there is a good or a bad choice. In our recent publication1, we opted for collapse=False.

3. Building the windows

Once the number of windows was determined, we need to “build” the windows. In other words, we need to split the participants in their respective windows. In sihnpy, this is the step where the sliding-window is applied to the data. Specifically, we use pandas’s iloc to grab participants while accounting for our window and step sizes. For all the windows except the last one, this is determined by the following equations:

Starting Index: stssize * (current window - 1)
Ending Index: wsize + stssize * (current window - 1)

Let’s image a window size of 100 and a step size of 20. The first window would be:

Starting Index: = 20 * (1 - 1) = 0
Ending Index: = 100 + 20 * (1 - 1) = 100

So the starting index for the first window would be 0 and the ending index would be 100. Let’s repeat this with window 5 just to demonstrate.

Starting Index: = 20 * (5 - 1) = 80
Ending Index: = 100 + 20 * (5 - 1) = 180

As you can see, we moved up our sliding-window so that it now starts at index 80, and ends at index 180.

Thankfully, you don’t need to compute any of that: sihnpy will do it for you!

w_store = sw.build_windows(data=pad_age_data, var='age', w_size=100, s_size=20, n_bin=n_windows_sane)
Creating bin 1
Creating bin 2
Creating bin 3
Creating bin 4
Creating bin 5
Creating bin 6
Creating bin 7
Creating bin 8
Creating bin 9
Creating bin 10
Creating bin 11
Creating bin 12

And like that, it’s done! sihnpy stored all of our windows in a dictionary. You can access any of the windows with the following naming convention:

ww{w_size}_sts{s_size}_w{n_window}

Note that so the file names have the same number of characters once output, windows 1-9 will have an extra 0 in their name (e.g., ww100_sts20_w08)

Let’s take a look at the first window to see what it contains:

w_store['ww100_sts20_w01']
participant_id
sub-3165520
sub-4396879
sub-9249727
sub-9327302
sub-4498598
...
sub-2757160
sub-9865768
sub-6967785
sub-1176949
sub-7755697

100 rows × 0 columns

Great! We see a pandas.dataframe with our participants and with 0 columns. This is normal: the columns are removed to simplify merging data later on and to easily output the list of participants as needed. More on that in the section on data export.

For fun, let’s check that the windows are “sliding” properly. Let’s take the last participant in our window: sub-7755697.

w_store['ww100_sts20_w01'].index.get_loc('sub-7755697')
99

In the first window, he is at the last position, i.e., position 100. It shows up as 99 but it is actually the 100th participant; this is normal because Python is 0-indexed (meaning that the count starts at 0, not 1). If the sliding window worked properly, the position of the participant will slide by 20 indices (so he should be at position 80; 79 in Python 0-index) in the next window. This should be the case for all subsequent window until the participant is no longer considered (which should happen in window 6):

print(w_store['ww100_sts20_w02'].index.get_loc('sub-7755697'))
print(w_store['ww100_sts20_w03'].index.get_loc('sub-7755697'))
print(w_store['ww100_sts20_w04'].index.get_loc('sub-7755697'))
print(w_store['ww100_sts20_w05'].index.get_loc('sub-7755697'))
try:
    w_store['ww100_sts20_w06'].index.get_loc('sub-7755697')
except KeyError:
    print("Participant not in this window")
79
59
39
19
Participant not in this window

That’s right on the money! The algorithm is working properly.

4. Reconstructing data in each window

As we saw in the previous step, the build_windows function returns only an index with the participant IDs. Now we need to associate the data of each participant in each window, so each window has its own spreadsheet with its own data. sihnpy only needs the dictionary in which we stored the IDs to create the windows and the original dataset.

w_data = sw.data_by_window(w_store=w_store, data=pad_age_data)
Reconstructing data for window ww100_sts20_w01
Reconstructing data for window ww100_sts20_w02
Reconstructing data for window ww100_sts20_w03
Reconstructing data for window ww100_sts20_w04
Reconstructing data for window ww100_sts20_w05
Reconstructing data for window ww100_sts20_w06
Reconstructing data for window ww100_sts20_w07
Reconstructing data for window ww100_sts20_w08
Reconstructing data for window ww100_sts20_w09
Reconstructing data for window ww100_sts20_w10
Reconstructing data for window ww100_sts20_w11
Reconstructing data for window ww100_sts20_w12

This creates a new dictionary, where each entry is a dataframe. For instance, let’s look again at the first window:

w_data['ww100_sts20_w01']
sex test_language handedness_score handedness_interpretation age
participant_id
sub-3165520 Male English 80.0 Right-handed 55.000000
sub-4396879 Male French 100.0 Right-handed 55.000000
sub-9249727 Female French 100.0 Right-handed 55.000000
sub-9327302 Female French 100.0 Right-handed 55.000000
sub-4498598 Female French 100.0 Right-handed 55.000000
... ... ... ... ... ...
sub-2757160 Female French 90.0 Right-handed 62.998585
sub-9865768 Female French 50.0 Right-handed 63.005971
sub-6967785 Male French 90.0 Right-handed 63.092224
sub-1176949 Female French 80.0 Right-handed 63.194265
sub-7755697 Female French 100.0 Right-handed 63.285955

100 rows × 5 columns

We get the full data for the 100 participants included in this window. We also see that sub-7755697 is still the last participants, keeping the same order we saw before. We can verify this if we are paranoid:

print(w_data['ww100_sts20_w02'].index.get_loc('sub-7755697'))
print(w_data['ww100_sts20_w03'].index.get_loc('sub-7755697'))
print(w_data['ww100_sts20_w04'].index.get_loc('sub-7755697'))
print(w_data['ww100_sts20_w05'].index.get_loc('sub-7755697'))
try:
    w_data['ww100_sts20_w06'].index.get_loc('sub-7755697')
except KeyError:
    print("Participant not in this window")
79
59
39
19
Participant not in this window

This gives the same result as before, so we’re all good.

5. Summary statistics for each window

Ok so sihnpy split our participants in window and slid across the age variable. That’s great. But something you might wonder is what is the actual age of each of the windows? You could easily compute this for each dataframe individually, but it is kind of a pain. Thankfully, I am quite a lazy programmer and I didn’t want to have to do that every time, so I integrated a function that does this in sihnpy. You just need to feed it the dictionary we just computed as well as the name of the variable you want an statistics on.

w_summary = sw.sum_by_window(w_data=w_data, var='age')
w_summary
mean_age median_age sd_age min_age max_age
window
ww100_sts20_w01 59.457366 59.790299 2.556229 55.000000 63.285955
ww100_sts20_w02 61.072791 61.320336 2.009204 57.227661 63.805879
ww100_sts20_w03 62.307265 62.682785 1.726616 58.661927 64.730185
ww100_sts20_w04 63.415577 63.598469 1.403649 60.456187 65.634214
ww100_sts20_w05 64.369545 64.364170 1.226137 62.049926 66.424965
ww100_sts20_w06 65.219789 65.186617 1.220798 63.318288 67.369632
ww100_sts20_w07 66.092823 66.022348 1.289132 63.870281 68.415997
ww100_sts20_w08 66.984937 66.885311 1.323495 64.744182 69.223976
ww100_sts20_w09 67.928772 67.910107 1.402626 65.653705 70.425195
ww100_sts20_w10 68.989369 68.841317 1.595768 66.443800 72.391602
ww100_sts20_w11 70.330097 70.028192 2.061643 67.493341 74.592965
ww100_sts20_w12 71.461680 70.743220 2.510278 68.420610 79.522548

sihnpy will output this dataframe, where each row is a window, and each column is a descriptive statistics. Easy-peasy.

Advanced topic: Summary data on variables not used to create the windows

Depending on your preferences, research question and topic, you might be interested in finding out more information for each window (e.g., report all demographics for each window in your demographics table). You can do this directly in sihnpy as long as the variables are continuous.

w_summary_hand = sw.sum_by_window(w_data=w_data, var='handedness_score')

The goal of the function originally was really to get the information on the variable used to sort people in their window as it is most often what people who will read your research will want to know, which is why sihnpy doesn’t offer support for binary variable (e.g., how many males/females in each window). That said, the function sum_by_window is really only a simple loop function wrapping pandas functions. You could still extract this information by doing the following:

w_summary_bin = pd.DataFrame() #Create empty dataframe to store your data

for window_labels, window_data in w_data.items(): #Iterate for each entry in the python dictionary
    count = w_data['sex'].value_counts() #Get the count of each sex (as many columns as there are types of values)
    count.loc['window'] = window_labels #Add the name of the window to the series
    w_summary_bin = pd.concat([w_summary_bin, count]).set_index('window') #Stack the series to the empty dataframe and set index to window

It’s a little bit more work, but you should be able to get the values you need. Note that whether you used sihnpy or created a manual function, if you want to export these files, you should do it manually rather than using export_data discussed in the next section. That function exports all window information (not just the summary). Instead you should manually export extraneous dataframes like so:

w_summary_bin.to_csv("/path/to/the/output/w_summary_bin.csv")

6. Exporting data

You made it all the way to the end. The last step is simply to export the data to file. sihnpy outputs a lot of files (2 per window + the summary statistics file) so be ready. It outputs both the full data dataframe (what we generated in step 4) as well as a text file for each window containing only the IDs (1 ID per line). Here is the code to export the data

sw.export_data(w_data=w_data, w_summary=w_summary, var='age', path='/path/to/output', name='suffix_to_add')

And you are done with the sliding-window analysis!

tl;dr

Too lazy to read everything? Or read everything and need a quick refresher? Here is the code in the order you need to make it work.

from sihnpy.datasets import pad_sw_input #For practice data
from sihnpy import sliding_window as sw #Sliding-window functions

pad_age_data = pad_sw_input() #Import practice data

n_windows = sw.bins(data=pad_age_data, var='age', w_size=100, s_size=20, collapse=False) #Computes the number of windows to create

w_store = sw.build_windows(data=pad_age_data, var='age', w_size=100, s_size=20, n_bin=n_windows) #Build the windows

w_data = sw.data_by_window(w_store=w_store, data=pad_age_data) #Reconstructs the data for each window

w_summary = sw.sum_by_window(w_data=w_data, var='age') #Computes summary statistics for each window

sw.export_data(w_data=w_data, w_summary=w_summary, var='age', path='/path/to/output', name='suffix_to_add') #Export the sliding-window data

References

Here are the references for this section:


1

St-Onge F, Javanray M, Pichet Binette A, Strikwerda-Brown C, Remz J, Spreng RN, Shafiei G, Misic B, Vachon-Presseau E, Villeneuve S. (In press). Functional connectome fingerprinting across the lifespan. Network Neuroscience. doi: 10.1162/netn_a_00320