Sliding-window analysis
Let’s get down to it! This module is probably the easiest to use in sihnpy so you will be on your way very quickly.
Already read the tutorial before and you just want the code (a.k.a. too long; didn’t read)? Head on out to the tl;dr section.
Practice data
In other sihnpy modules, real data from a subset of the PREVENT-AD Open Dataset is used. While we could probably use the continuous handedness scale to create windows, there is not a lot of diversity in the numbers (so a lot of participants that are fully right-handed would be clumped together in most windows).
A variable that usually lends itself well to the sliding-window approach is age (and well… this is the variable that was originally used for this methodology). Age is not available in the PREVENT-AD Open Dataset as it is a restricted information. Instead, I opted to simulate age data for the PREVENT-AD participants. This is similar to the data available for the Spatial Extent. I will actually refer you to that section for more detailed information on how the age data was simulated in the PREVENT-AD as the data for the sliding-window was created in a very similar manner (though in the sliding-window we create a single Gaussian distribution).
Just know that the simulated age data matches the mean, standard deviation and inclusion criteria from the PREVENT-AD: mean age of 65 years, standard deviation of 5 years and minimum age to be included is 55 years.
Warning
Just like for the spatial extent module, sihnpy provides practice data to use the sliding-window module. While PREVENT-AD participants are used, the data is simulated. As a general rule for sihnpy, and especially for this module, only use the data provided to help you practice using the module, not to conduct or publish actual research.
Deriving the sliding windows
1. Preparing the data
To run the sliding-window module, you will need two things:
A spreadsheet with the data, with the index set as the participants’ IDs
The name of the variable we want to slide along
If you already have your data ready, you can skip ahead to the next section.
As mentioned before, sihnpy has data available for you to use to practice:
from sihnpy.datasets import pad_sw_input
pad_age_data = pad_sw_input()
pad_age_data
| sex | test_language | handedness_score | handedness_interpretation | age | |
|---|---|---|---|---|---|
| participant_id | |||||
| sub-5458966 | Male | French | 80.00 | Right-handed | 65.892657 |
| sub-2424540 | Female | French | 100.00 | Right-handed | 65.543026 |
| sub-7855613 | Female | French | 90.00 | Right-handed | 59.054610 |
| sub-3137570 | Male | French | 90.00 | Right-handed | 65.653705 |
| sub-9650197 | Female | French | 100.00 | Right-handed | 68.059713 |
| ... | ... | ... | ... | ... | ... |
| sub-5336241 | Female | French | -30.00 | Ambidextrous | 70.342373 |
| sub-1002928 | Female | French | 100.00 | Right-handed | 68.658707 |
| sub-1283278 | Female | English | 80.00 | Right-handed | 61.154117 |
| sub-9101699 | Male | French | 57.89 | Right-handed | 62.495973 |
| sub-6261459 | Male | French | 100.00 | Right-handed | 64.487385 |
308 rows × 5 columns
We have our dataset, which is basically the demographic information for our participants included in the PREVENT-AD Open Dataset with the simulated age. We’re ready to go!
Fix: Dataframe index
While importing the data from sihnpy is easy and already in the right format, it is critical that the data used in the rest of the functions have the final index you want to use. Contrary to R, pandas uses an Index method, where each row is referred to by a label. sihnpy will sort participants and output the results of each window with these labels.
In sihnpy in particular, we output the list of participants in each window (i.e., we extract and output the index of the dataframe) and we output the data of each window (i.e., we extract and output the data of each window based on their index). You can easily set an index by doing the following procedure when importing your data in Python:
import pandas as pd
data = pd.read_csv("/path/to/file.csv", index_col=0) #The number is the position (integer) of the column to be used as index
Otherwise, if you already have the data imported in Python, you can manually force the index to the variable you want:
import pandas as pd
data = pd.read_csv("/path/to/file.csv")
data_indexed = data.set_index('named_column') #Where `named_column` is the name of the column to use as index
2. Calculating the number of windows
The first step is to estimate how many windows we want to create. To ensure we don’t have empty windows, sihnpy will use your desired window and step size to compute the ideal number of windows. For a refresher on what is the window and the step size, you can refer to the introduction to the **sliding-window**.
For example, let’s say I would like a window size of 100 and step size of 20, then I would simply need to tell sihnpy:
from sihnpy import sliding_window as sw
n_windows_sane = sw.bins(data=pad_age_data, var='age', w_size=100, s_size=20)
Collapse is False: the last window may have a smaller number of participants
Number of windows: 12
You aren’t limited to how you want to divide the windows. So you can even use odd numbers if you like (not my preference cos I don’t like odd numbers… but if it fits your research design, great!):
n_windows_insane = sw.bins(data=pad_age_data, var='age', w_size=66, s_size=13)
Collapse is False: the last window may have a smaller number of participants
Number of windows: 20
The crucial information is really only to tell sihnpy which pandas.DataFrame it should be using (pad_age_data in our case) and the name of the variable it should use for sorting and figuring out the windows.
Warning
Missing values are not currently tolerated in the sliding-window module and will throw errors. Make sure that there are no missing values on your sorting variable. Future versions will allow users to choose whether to throw errors, put missing values first or last.
Advanced topic: Collapse argument
You might have noticed above that these is a message indicating that “Collapse is False”. What is that?
First, we need to understand how sihnpy computed the number of windows. The formula is as follows:
nwindows = ceiling((nsub - wsize) / stssize)
Where nwindows is the resulting number of windows, nsub is the number of participants in the whole sample, wsize is the size of the windows we want and ssize is the step size we want. In other words, the formula substracts the window size from the total number of participants, and divides the result by the step size. Because the numbers we choose can result in divisions with remainders, we force a ceiling rounding (rounding up) as we can’t have fractions of windows. This is also where the collapse argument comes into play. As the parameters we choose for the windows will almost never fall on a number of windows where the participants in each window all have the same number, sihnpy proposes to the user to choose how to deal with this.
In the default scenario (collapse=False), sihnpy will assume that we prefer having more windows, but the last window will have less participants. In this case, sihnpy will automatically add one more window to the number where remainding participants will be put in. In the other scenario (collapse=True), sihnpy will assume that we prefer having less windows, but the last window will have more participants. In this case, sihnpy will not add any window to the count, but the last window will have more participants.
It is possible that you choose a step size and window size that would ensure that there are no extra participants at the end. In such a case, please set collapse=True, as otherwise sihnpy will create an extraneous empty window.
In the end, choosing to collapse or to not collapse fall unto the user, but I don’t think there is a good or a bad choice. In our recent publication1, we opted for collapse=False.
3. Building the windows
Once the number of windows was determined, we need to “build” the windows. In other words, we need to split the participants in their respective windows. In sihnpy, this is the step where the sliding-window is applied to the data. Specifically, we use pandas’s iloc to grab participants while accounting for our window and step sizes. For all the windows except the last one, this is determined by the following equations:
Starting Index: stssize * (current window - 1)
Ending Index: wsize + stssize * (current window - 1)
Let’s image a window size of 100 and a step size of 20. The first window would be:
Starting Index: = 20 * (1 - 1) = 0
Ending Index: = 100 + 20 * (1 - 1) = 100
So the starting index for the first window would be 0 and the ending index would be 100. Let’s repeat this with window 5 just to demonstrate.
Starting Index: = 20 * (5 - 1) = 80
Ending Index: = 100 + 20 * (5 - 1) = 180
As you can see, we moved up our sliding-window so that it now starts at index 80, and ends at index 180.
Thankfully, you don’t need to compute any of that: sihnpy will do it for you!
w_store = sw.build_windows(data=pad_age_data, var='age', w_size=100, s_size=20, n_bin=n_windows_sane)
Creating bin 1
Creating bin 2
Creating bin 3
Creating bin 4
Creating bin 5
Creating bin 6
Creating bin 7
Creating bin 8
Creating bin 9
Creating bin 10
Creating bin 11
Creating bin 12
And like that, it’s done! sihnpy stored all of our windows in a dictionary. You can access any of the windows with the following naming convention:
ww{w_size}_sts{s_size}_w{n_window}
Note that so the file names have the same number of characters once output, windows 1-9 will have an extra 0 in their name (e.g., ww100_sts20_w08)
Let’s take a look at the first window to see what it contains:
w_store['ww100_sts20_w01']
| participant_id |
|---|
| sub-3165520 |
| sub-4396879 |
| sub-9249727 |
| sub-9327302 |
| sub-4498598 |
| ... |
| sub-2757160 |
| sub-9865768 |
| sub-6967785 |
| sub-1176949 |
| sub-7755697 |
100 rows × 0 columns
Great! We see a pandas.dataframe with our participants and with 0 columns. This is normal: the columns are removed to simplify merging data later on and to easily output the list of participants as needed. More on that in the section on data export.
For fun, let’s check that the windows are “sliding” properly. Let’s take the last participant in our window: sub-7755697.
w_store['ww100_sts20_w01'].index.get_loc('sub-7755697')
99
In the first window, he is at the last position, i.e., position 100. It shows up as 99 but it is actually the 100th participant; this is normal because Python is 0-indexed (meaning that the count starts at 0, not 1). If the sliding window worked properly, the position of the participant will slide by 20 indices (so he should be at position 80; 79 in Python 0-index) in the next window. This should be the case for all subsequent window until the participant is no longer considered (which should happen in window 6):
print(w_store['ww100_sts20_w02'].index.get_loc('sub-7755697'))
print(w_store['ww100_sts20_w03'].index.get_loc('sub-7755697'))
print(w_store['ww100_sts20_w04'].index.get_loc('sub-7755697'))
print(w_store['ww100_sts20_w05'].index.get_loc('sub-7755697'))
try:
w_store['ww100_sts20_w06'].index.get_loc('sub-7755697')
except KeyError:
print("Participant not in this window")
79
59
39
19
Participant not in this window
That’s right on the money! The algorithm is working properly.
4. Reconstructing data in each window
As we saw in the previous step, the build_windows function returns only an index with the participant IDs. Now we need to associate the data of each participant in each window, so each window has its own spreadsheet with its own data. sihnpy only needs the dictionary in which we stored the IDs to create the windows and the original dataset.
w_data = sw.data_by_window(w_store=w_store, data=pad_age_data)
Reconstructing data for window ww100_sts20_w01
Reconstructing data for window ww100_sts20_w02
Reconstructing data for window ww100_sts20_w03
Reconstructing data for window ww100_sts20_w04
Reconstructing data for window ww100_sts20_w05
Reconstructing data for window ww100_sts20_w06
Reconstructing data for window ww100_sts20_w07
Reconstructing data for window ww100_sts20_w08
Reconstructing data for window ww100_sts20_w09
Reconstructing data for window ww100_sts20_w10
Reconstructing data for window ww100_sts20_w11
Reconstructing data for window ww100_sts20_w12
This creates a new dictionary, where each entry is a dataframe. For instance, let’s look again at the first window:
w_data['ww100_sts20_w01']
| sex | test_language | handedness_score | handedness_interpretation | age | |
|---|---|---|---|---|---|
| participant_id | |||||
| sub-3165520 | Male | English | 80.0 | Right-handed | 55.000000 |
| sub-4396879 | Male | French | 100.0 | Right-handed | 55.000000 |
| sub-9249727 | Female | French | 100.0 | Right-handed | 55.000000 |
| sub-9327302 | Female | French | 100.0 | Right-handed | 55.000000 |
| sub-4498598 | Female | French | 100.0 | Right-handed | 55.000000 |
| ... | ... | ... | ... | ... | ... |
| sub-2757160 | Female | French | 90.0 | Right-handed | 62.998585 |
| sub-9865768 | Female | French | 50.0 | Right-handed | 63.005971 |
| sub-6967785 | Male | French | 90.0 | Right-handed | 63.092224 |
| sub-1176949 | Female | French | 80.0 | Right-handed | 63.194265 |
| sub-7755697 | Female | French | 100.0 | Right-handed | 63.285955 |
100 rows × 5 columns
We get the full data for the 100 participants included in this window. We also see that sub-7755697 is still the last participants, keeping the same order we saw before. We can verify this if we are paranoid:
print(w_data['ww100_sts20_w02'].index.get_loc('sub-7755697'))
print(w_data['ww100_sts20_w03'].index.get_loc('sub-7755697'))
print(w_data['ww100_sts20_w04'].index.get_loc('sub-7755697'))
print(w_data['ww100_sts20_w05'].index.get_loc('sub-7755697'))
try:
w_data['ww100_sts20_w06'].index.get_loc('sub-7755697')
except KeyError:
print("Participant not in this window")
79
59
39
19
Participant not in this window
This gives the same result as before, so we’re all good.
5. Summary statistics for each window
Ok so sihnpy split our participants in window and slid across the age variable. That’s great. But something you might wonder is what is the actual age of each of the windows? You could easily compute this for each dataframe individually, but it is kind of a pain. Thankfully, I am quite a lazy programmer and I didn’t want to have to do that every time, so I integrated a function that does this in sihnpy. You just need to feed it the dictionary we just computed as well as the name of the variable you want an statistics on.
w_summary = sw.sum_by_window(w_data=w_data, var='age')
w_summary
| mean_age | median_age | sd_age | min_age | max_age | |
|---|---|---|---|---|---|
| window | |||||
| ww100_sts20_w01 | 59.457366 | 59.790299 | 2.556229 | 55.000000 | 63.285955 |
| ww100_sts20_w02 | 61.072791 | 61.320336 | 2.009204 | 57.227661 | 63.805879 |
| ww100_sts20_w03 | 62.307265 | 62.682785 | 1.726616 | 58.661927 | 64.730185 |
| ww100_sts20_w04 | 63.415577 | 63.598469 | 1.403649 | 60.456187 | 65.634214 |
| ww100_sts20_w05 | 64.369545 | 64.364170 | 1.226137 | 62.049926 | 66.424965 |
| ww100_sts20_w06 | 65.219789 | 65.186617 | 1.220798 | 63.318288 | 67.369632 |
| ww100_sts20_w07 | 66.092823 | 66.022348 | 1.289132 | 63.870281 | 68.415997 |
| ww100_sts20_w08 | 66.984937 | 66.885311 | 1.323495 | 64.744182 | 69.223976 |
| ww100_sts20_w09 | 67.928772 | 67.910107 | 1.402626 | 65.653705 | 70.425195 |
| ww100_sts20_w10 | 68.989369 | 68.841317 | 1.595768 | 66.443800 | 72.391602 |
| ww100_sts20_w11 | 70.330097 | 70.028192 | 2.061643 | 67.493341 | 74.592965 |
| ww100_sts20_w12 | 71.461680 | 70.743220 | 2.510278 | 68.420610 | 79.522548 |
sihnpy will output this dataframe, where each row is a window, and each column is a descriptive statistics. Easy-peasy.
Advanced topic: Summary data on variables not used to create the windows
Depending on your preferences, research question and topic, you might be interested in finding out more information for each window (e.g., report all demographics for each window in your demographics table). You can do this directly in sihnpy as long as the variables are continuous.
w_summary_hand = sw.sum_by_window(w_data=w_data, var='handedness_score')
The goal of the function originally was really to get the information on the variable used to sort people in their window as it is most often what people who will read your research will want to know, which is why sihnpy doesn’t offer support for binary variable (e.g., how many males/females in each window). That said, the function sum_by_window is really only a simple loop function wrapping pandas functions. You could still extract this information by doing the following:
w_summary_bin = pd.DataFrame() #Create empty dataframe to store your data
for window_labels, window_data in w_data.items(): #Iterate for each entry in the python dictionary
count = w_data['sex'].value_counts() #Get the count of each sex (as many columns as there are types of values)
count.loc['window'] = window_labels #Add the name of the window to the series
w_summary_bin = pd.concat([w_summary_bin, count]).set_index('window') #Stack the series to the empty dataframe and set index to window
It’s a little bit more work, but you should be able to get the values you need. Note that whether you used sihnpy or created a manual function, if you want to export these files, you should do it manually rather than using export_data discussed in the next section. That function exports all window information (not just the summary). Instead you should manually export extraneous dataframes like so:
w_summary_bin.to_csv("/path/to/the/output/w_summary_bin.csv")
6. Exporting data
You made it all the way to the end. The last step is simply to export the data to file. sihnpy outputs a lot of files (2 per window + the summary statistics file) so be ready. It outputs both the full data dataframe (what we generated in step 4) as well as a text file for each window containing only the IDs (1 ID per line). Here is the code to export the data
sw.export_data(w_data=w_data, w_summary=w_summary, var='age', path='/path/to/output', name='suffix_to_add')
And you are done with the sliding-window analysis!
tl;dr
Too lazy to read everything? Or read everything and need a quick refresher? Here is the code in the order you need to make it work.
from sihnpy.datasets import pad_sw_input #For practice data
from sihnpy import sliding_window as sw #Sliding-window functions
pad_age_data = pad_sw_input() #Import practice data
n_windows = sw.bins(data=pad_age_data, var='age', w_size=100, s_size=20, collapse=False) #Computes the number of windows to create
w_store = sw.build_windows(data=pad_age_data, var='age', w_size=100, s_size=20, n_bin=n_windows) #Build the windows
w_data = sw.data_by_window(w_store=w_store, data=pad_age_data) #Reconstructs the data for each window
w_summary = sw.sum_by_window(w_data=w_data, var='age') #Computes summary statistics for each window
sw.export_data(w_data=w_data, w_summary=w_summary, var='age', path='/path/to/output', name='suffix_to_add') #Export the sliding-window data
References
Here are the references for this section:
- 1
St-Onge F, Javanray M, Pichet Binette A, Strikwerda-Brown C, Remz J, Spreng RN, Shafiei G, Misic B, Vachon-Presseau E, Villeneuve S. (In press). Functional connectome fingerprinting across the lifespan. Network Neuroscience. doi: 10.1162/netn_a_00320