Sliding-window analysis

Let’s get down to it! This module is probably the easiest to use in sihnpy so you will be on your way very quickly.

Already read the tutorial before and you just want the code (a.k.a. too long; didn’t read)? Head on out to the tl;dr section.

Practice data

In other sihnpy modules, real data from a subset of the PREVENT-AD Open Dataset is used. While we could probably use the continuous handedness scale to create windows, there is not a lot of diversity in the numbers (so a lot of participants that are fully right-handed would be clumped together in most windows).

A variable that usually lends itself well to the sliding-window approach is age (and well… this is the variable that was originally used for this methodology). Age is not available in the PREVENT-AD Open Dataset as it is a restricted information. Instead, I opted to simulate age data for the PREVENT-AD participants. This is similar to the data available for the Spatial Extent. I will actually refer you to that section for more detailed information on how the age data was simulated in the PREVENT-AD as the data for the sliding-window was created in a very similar manner (though in the sliding-window we create a single Gaussian distribution).

Just know that the simulated age data matches the mean, standard deviation and inclusion criteria from the PREVENT-AD: mean age of 65 years, standard deviation of 5 years and minimum age to be included is 55 years.

Warning

Just like for the spatial extent module, sihnpy provides practice data to use the sliding-window module. While PREVENT-AD participants are used, the data is simulated. As a general rule for sihnpy, and especially for this module, only use the data provided to help you practice using the module, not to conduct or publish actual research.

Deriving the sliding windows

1. Preparing the data

To run the sliding-window module, you will need two things:

A spreadsheet with the data, with the index set as the participants’ IDs
The name of the variable we want to slide along

If you already have your data ready, you can skip ahead to the next section.

As mentioned before, sihnpy has data available for you to use to practice:

from sihnpy.datasets import pad_sw_input

pad_age_data = pad_sw_input()
pad_age_data

	sex	test_language	handedness_score	handedness_interpretation	age
participant_id
sub-5458966	Male	French	80.00	Right-handed	65.892657
sub-2424540	Female	French	100.00	Right-handed	65.543026
sub-7855613	Female	French	90.00	Right-handed	59.054610
sub-3137570	Male	French	90.00	Right-handed	65.653705
sub-9650197	Female	French	100.00	Right-handed	68.059713
...	...	...	...	...	...
sub-5336241	Female	French	-30.00	Ambidextrous	70.342373
sub-1002928	Female	French	100.00	Right-handed	68.658707
sub-1283278	Female	English	80.00	Right-handed	61.154117
sub-9101699	Male	French	57.89	Right-handed	62.495973
sub-6261459	Male	French	100.00	Right-handed	64.487385

308 rows × 5 columns

We have our dataset, which is basically the demographic information for our participants included in the PREVENT-AD Open Dataset with the simulated age. We’re ready to go!

Fix: Dataframe index

While importing the data from sihnpy is easy and already in the right format, it is critical that the data used in the rest of the functions have the final index you want to use. Contrary to R, pandas uses an Index method, where each row is referred to by a label. sihnpy will sort participants and output the results of each window with these labels.

In sihnpy in particular, we output the list of participants in each window (i.e., we extract and output the index of the dataframe) and we output the data of each window (i.e., we extract and output the data of each window based on their index). You can easily set an index by doing the following procedure when importing your data in Python:

import pandas as pd

data = pd.read_csv("/path/to/file.csv", index_col=0) #The number is the position (integer) of the column to be used as index

Otherwise, if you already have the data imported in Python, you can manually force the index to the variable you want:

import pandas as pd

data = pd.read_csv("/path/to/file.csv")
data_indexed = data.set_index('named_column') #Where `named_column` is the name of the column to use as index

2. Calculating the number of windows

The first step is to estimate how many windows we want to create. To ensure we don’t have empty windows, sihnpy will use your desired window and step size to compute the ideal number of windows. For a refresher on what is the window and the step size, you can refer to the introduction to the **sliding-window**.

For example, let’s say I would like a window size of 100 and step size of 20, then I would simply need to tell sihnpy:

from sihnpy import sliding_window as sw

n_windows_sane = sw.bins(data=pad_age_data, var='age', w_size=100, s_size=20)

Collapse is False: the last window may have a smaller number of participants
Number of windows: 12

You aren’t limited to how you want to divide the windows. So you can even use odd numbers if you like (not my preference cos I don’t like odd numbers… but if it fits your research design, great!):

n_windows_insane = sw.bins(data=pad_age_data, var='age', w_size=66, s_size=13)

Collapse is False: the last window may have a smaller number of participants
Number of windows: 20

The crucial information is really only to tell sihnpy which pandas.DataFrame it should be using (pad_age_data in our case) and the name of the variable it should use for sorting and figuring out the windows.

Warning

Missing values are not currently tolerated in the sliding-window module and will throw errors. Make sure that there are no missing values on your sorting variable. Future versions will allow users to choose whether to throw errors, put missing values first or last.

Advanced topic: Collapse argument

You might have noticed above that these is a message indicating that “Collapse is False”. What is that?

First, we need to understand how sihnpy computed the number of windows. The formula is as follows:

n_windows = ceiling((n_sub - w_size) / sts_size)

Where n_windows is the resulting number of windows, n_sub is the number of participants in the whole sample, w_size is the size of the windows we want and s_size is the step size we want. In other words, the formula substracts the window size from the total number of participants, and divides the result by the step size. Because the numbers we choose can result in divisions with remainders, we force a ceiling rounding (rounding up) as we can’t have fractions of windows. This is also where the collapse argument comes into play. As the parameters we choose for the windows will almost never fall on a number of windows where the participants in each window all have the same number, sihnpy proposes to the user to choose how to deal with this.

In the default scenario (collapse=False), sihnpy will assume that we prefer having more windows, but the last window will have less participants. In this case, sihnpy will automatically add one more window to the number where remainding participants will be put in. In the other scenario (collapse=True), sihnpy will assume that we prefer having less windows, but the last window will have more participants. In this case, sihnpy will not add any window to the count, but the last window will have more participants.

It is possible that you choose a step size and window size that would ensure that there are no extra participants at the end. In such a case, please set collapse=True, as otherwise sihnpy will create an extraneous empty window.

In the end, choosing to collapse or to not collapse fall unto the user, but I don’t think there is a good or a bad choice. In our recent publication1, we opted for collapse=False.

3. Building the windows

Once the number of windows was determined, we need to “build” the windows. In other words, we need to split the participants in their respective windows. In sihnpy, this is the step where the sliding-window is applied to the data. Specifically, we use pandas’s iloc to grab participants while accounting for our window and step sizes. For all the windows except the last one, this is determined by the following equations:

Starting Index: sts_size * (current window - 1)
Ending Index: w_size + sts_size * (current window - 1)

Let’s image a window size of 100 and a step size of 20. The first window would be:

Starting Index: = 20 * (1 - 1) = 0
Ending Index: = 100 + 20 * (1 - 1) = 100

So the starting index for the first window would be 0 and the ending index would be 100. Let’s repeat this with window 5 just to demonstrate.

Starting Index: = 20 * (5 - 1) = 80
Ending Index: = 100 + 20 * (5 - 1) = 180

As you can see, we moved up our sliding-window so that it now starts at index 80, and ends at index 180.

Thankfully, you don’t need to compute any of that: sihnpy will do it for you!

w_store = sw.build_windows(data=pad_age_data, var='age', w_size=100, s_size=20, n_bin=n_windows_sane)

Creating bin 1
Creating bin 2
Creating bin 3
Creating bin 4
Creating bin 5
Creating bin 6
Creating bin 7
Creating bin 8
Creating bin 9
Creating bin 10
Creating bin 11
Creating bin 12

And like that, it’s done! sihnpy stored all of our windows in a dictionary. You can access any of the windows with the following naming convention:

ww{w_size}_sts{s_size}_w{n_window}

Note that so the file names have the same number of characters once output, windows 1-9 will have an extra 0 in their name (e.g., ww100_sts20_w08)

Let’s take a look at the first window to see what it contains:

w_store['ww100_sts20_w01']


participant_id
sub-3165520
sub-4396879
sub-9249727
sub-9327302
sub-4498598
...
sub-2757160
sub-9865768
sub-6967785
sub-1176949
sub-7755697

100 rows × 0 columns

Great! We see a pandas.dataframe with our participants and with 0 columns. This is normal: the columns are removed to simplify merging data later on and to easily output the list of participants as needed. More on that in the section on data export.

For fun, let’s check that the windows are “sliding” properly. Let’s take the last participant in our window: sub-7755697.

w_store['ww100_sts20_w01'].index.get_loc('sub-7755697')

In the first window, he is at the last position, i.e., position 100. It shows up as 99 but it is actually the 100th participant; this is normal because Python is 0-indexed (meaning that the count starts at 0, not 1). If the sliding window worked properly, the position of the participant will slide by 20 indices (so he should be at position 80; 79 in Python 0-index) in the next window. This should be the case for all subsequent window until the participant is no longer considered (which should happen in window 6):

print(w_store['ww100_sts20_w02'].index.get_loc('sub-7755697'))
print(w_store['ww100_sts20_w03'].index.get_loc('sub-7755697'))
print(w_store['ww100_sts20_w04'].index.get_loc('sub-7755697'))
print(w_store['ww100_sts20_w05'].index.get_loc('sub-7755697'))
try:
    w_store['ww100_sts20_w06'].index.get_loc('sub-7755697')
except KeyError:
    print("Participant not in this window")

79
59
39
19
Participant not in this window

That’s right on the money! The algorithm is working properly.

4. Reconstructing data in each window

As we saw in the previous step, the build_windows function returns only an index with the participant IDs. Now we need to associate the data of each participant in each window, so each window has its own spreadsheet with its own data. sihnpy only needs the dictionary in which we stored the IDs to create the windows and the original dataset.

w_data = sw.data_by_window(w_store=w_store, data=pad_age_data)

Reconstructing data for window ww100_sts20_w01
Reconstructing data for window ww100_sts20_w02
Reconstructing data for window ww100_sts20_w03
Reconstructing data for window ww100_sts20_w04
Reconstructing data for window ww100_sts20_w05
Reconstructing data for window ww100_sts20_w06
Reconstructing data for window ww100_sts20_w07
Reconstructing data for window ww100_sts20_w08
Reconstructing data for window ww100_sts20_w09
Reconstructing data for window ww100_sts20_w10
Reconstructing data for window ww100_sts20_w11
Reconstructing data for window ww100_sts20_w12

This creates a new dictionary, where each entry is a dataframe. For instance, let’s look again at the first window:

w_data['ww100_sts20_w01']

	sex	test_language	handedness_score	handedness_interpretation	age
participant_id
sub-3165520	Male	English	80.0	Right-handed	55.000000
sub-4396879	Male	French	100.0	Right-handed	55.000000
sub-9249727	Female	French	100.0	Right-handed	55.000000
sub-9327302	Female	French	100.0	Right-handed	55.000000
sub-4498598	Female	French	100.0	Right-handed	55.000000
...	...	...	...	...	...
sub-2757160	Female	French	90.0	Right-handed	62.998585
sub-9865768	Female	French	50.0	Right-handed	63.005971
sub-6967785	Male	French	90.0	Right-handed	63.092224
sub-1176949	Female	French	80.0	Right-handed	63.194265
sub-7755697	Female	French	100.0	Right-handed	63.285955

100 rows × 5 columns

We get the full data for the 100 participants included in this window. We also see that sub-7755697 is still the last participants, keeping the same order we saw before. We can verify this if we are paranoid:

print(w_data['ww100_sts20_w02'].index.get_loc('sub-7755697'))
print(w_data['ww100_sts20_w03'].index.get_loc('sub-7755697'))
print(w_data['ww100_sts20_w04'].index.get_loc('sub-7755697'))
print(w_data['ww100_sts20_w05'].index.get_loc('sub-7755697'))
try:
    w_data['ww100_sts20_w06'].index.get_loc('sub-7755697')
except KeyError:
    print("Participant not in this window")

79
59
39
19
Participant not in this window

This gives the same result as before, so we’re all good.

5. Summary statistics for each window

Ok so sihnpy split our participants in window and slid across the age variable. That’s great. But something you might wonder is what is the actual age of each of the windows? You could easily compute this for each dataframe individually, but it is kind of a pain. Thankfully, I am quite a lazy programmer and I didn’t want to have to do that every time, so I integrated a function that does this in sihnpy. You just need to feed it the dictionary we just computed as well as the name of the variable you want an statistics on.

w_summary = sw.sum_by_window(w_data=w_data, var='age')
w_summary

	mean_age	median_age	sd_age	min_age	max_age
window
ww100_sts20_w01	59.457366	59.790299	2.556229	55.000000	63.285955
ww100_sts20_w02	61.072791	61.320336	2.009204	57.227661	63.805879
ww100_sts20_w03	62.307265	62.682785	1.726616	58.661927	64.730185
ww100_sts20_w04	63.415577	63.598469	1.403649	60.456187	65.634214
ww100_sts20_w05	64.369545	64.364170	1.226137	62.049926	66.424965
ww100_sts20_w06	65.219789	65.186617	1.220798	63.318288	67.369632
ww100_sts20_w07	66.092823	66.022348	1.289132	63.870281	68.415997
ww100_sts20_w08	66.984937	66.885311	1.323495	64.744182	69.223976
ww100_sts20_w09	67.928772	67.910107	1.402626	65.653705	70.425195
ww100_sts20_w10	68.989369	68.841317	1.595768	66.443800	72.391602
ww100_sts20_w11	70.330097	70.028192	2.061643	67.493341	74.592965
ww100_sts20_w12	71.461680	70.743220	2.510278	68.420610	79.522548

sihnpy will output this dataframe, where each row is a window, and each column is a descriptive statistics. Easy-peasy.

Advanced topic: Summary data on variables not used to create the windows

Depending on your preferences, research question and topic, you might be interested in finding out more information for each window (e.g., report all demographics for each window in your demographics table). You can do this directly in sihnpy as long as the variables are continuous.

w_summary_hand = sw.sum_by_window(w_data=w_data, var='handedness_score')

The goal of the function originally was really to get the information on the variable used to sort people in their window as it is most often what people who will read your research will want to know, which is why sihnpy doesn’t offer support for binary variable (e.g., how many males/females in each window). That said, the function sum_by_window is really only a simple loop function wrapping pandas functions. You could still extract this information by doing the following:

w_summary_bin = pd.DataFrame() #Create empty dataframe to store your data

for window_labels, window_data in w_data.items(): #Iterate for each entry in the python dictionary
    count = w_data['sex'].value_counts() #Get the count of each sex (as many columns as there are types of values)
    count.loc['window'] = window_labels #Add the name of the window to the series
    w_summary_bin = pd.concat([w_summary_bin, count]).set_index('window') #Stack the series to the empty dataframe and set index to window

It’s a little bit more work, but you should be able to get the values you need. Note that whether you used sihnpy or created a manual function, if you want to export these files, you should do it manually rather than using export_data discussed in the next section. That function exports all window information (not just the summary). Instead you should manually export extraneous dataframes like so:

w_summary_bin.to_csv("/path/to/the/output/w_summary_bin.csv")

6. Exporting data

You made it all the way to the end. The last step is simply to export the data to file. sihnpy outputs a lot of files (2 per window + the summary statistics file) so be ready. It outputs both the full data dataframe (what we generated in step 4) as well as a text file for each window containing only the IDs (1 ID per line). Here is the code to export the data

sw.export_data(w_data=w_data, w_summary=w_summary, var='age', path='/path/to/output', name='suffix_to_add')

And you are done with the sliding-window analysis!

tl;dr

Too lazy to read everything? Or read everything and need a quick refresher? Here is the code in the order you need to make it work.

from sihnpy.datasets import pad_sw_input #For practice data
from sihnpy import sliding_window as sw #Sliding-window functions

pad_age_data = pad_sw_input() #Import practice data

n_windows = sw.bins(data=pad_age_data, var='age', w_size=100, s_size=20, collapse=False) #Computes the number of windows to create

w_store = sw.build_windows(data=pad_age_data, var='age', w_size=100, s_size=20, n_bin=n_windows) #Build the windows

w_data = sw.data_by_window(w_store=w_store, data=pad_age_data) #Reconstructs the data for each window

w_summary = sw.sum_by_window(w_data=w_data, var='age') #Computes summary statistics for each window

sw.export_data(w_data=w_data, w_summary=w_summary, var='age', path='/path/to/output', name='suffix_to_add') #Export the sliding-window data

References

Here are the references for this section:

1: St-Onge F, Javanray M, Pichet Binette A, Strikwerda-Brown C, Remz J, Spreng RN, Shafiei G, Misic B, Vachon-Presseau E, Villeneuve S. (In press). Functional connectome fingerprinting across the lifespan. Network Neuroscience. doi: 10.1162/netn_a_00320