Downloading PREVENT-AD data
When it comes to great quality data, the PREVENT-AD cohort has got your back. Multi-modal imaging? Yearly visits up to four years? What more can one neuroimager ask for?
The main issue comes down to… well… actually downloading and preprocessing it. It’s sheer size gives my personal computer’s processor anxiety and the time it would take to process everything gives ME anxiety.
Thankfully, I was very fortunate to have access to Beluga, a high performance computing resource hosted and managed by the Digital Research Alliance of Canada (DRAC). Any qualified principal investigator at a Canadian institution can apply to have access to computing resources. So if you are a student, you can convince your PI to apply for access!
Don’t have access to a HPC? Only want a subset of participants or a specific brain modality? Or you have a dedicated local computer to do the heavy lifting for the storage and preprocessing of brain images? No sweat! The downloading process allows you a lot of flexibility for all use cases.
In the next sections, I detail how to download the data and I give an example of preprocessing said data using fMRIPrep. I also add extra precisions for when the data is downloaded to a high performance computing environment (or, well, more specific to DRAC’s resources).
Note that while I try to explain things as plainly as possible, the information below does assume a certain familiarity with Python and the command line. Got questions or something isn’t clear? Open an issue on Github.
Downloading the dataset
2. - Setting up Datalad
Access and download of the dataset is enabled through Datalad. Instructions to download the dataset are here. Note that instructions vary depending on what platform you are downloading the dataset.
I am not going to focus on the instructions to download Datalad which are detailed elsewhere, but the main idea is that you need version >=0.12.5 of DataLad and version >=8.20200309 of git-annex. Depending on your choice for installing Datalad, you might also need Python (and I mean… if you are using sihnpy you already have/need Python).
Once the software above is installed you can move on to the next step.
DRAC HPCs
Datalad in an high performance computer environment
Datalad is usually very easy to install, but may require a bit more work to install on a super computer. Datalad has more extensive information on the topic here. While not directly addressed on the Wiki of the Digital Research Alliance of Canada (DRAC), others have faced issues using these resources in installing/using Datalad.
The data part of sihnpy has been downloaded and preprocessed directly on Beluga. Throughout this section, I will give tips and tricks when using Datalad on Beluga, which should also apply on other DRAC high performance computing resources.
At the time of this writing, git-annex and python/3 are modules already installed on the servers. The only step missing to install Datalad is to create a Python virtual environment in which to install it. The four lines below fully install Datalad on Beluga:
$ module load git-annex python/3 #Loads the modules
$ virtualenv ~/venv_datalad #Creates virtual environment
$ source ~/venv_datalad/bin/activate #Activate virtual environment
$ pip install datalad #Install Datalad in the virtual environment
Because Beluga has root privilege locked for users, installing Datalad through Python makes the most sense. Conda is also not available on the Cluster, meaning it needs to be downloaded using pip.
3. - Downloading the data
Now is the fun (but a bit long) part: downloading the data. From this point on, there isn’t really any difference between the three PREVENT-AD datasets. The steps are exactly the same and are all detailed here. If you are using a Python based installation of Datalad, make sure the Python environment is sourced before executing the command (this happened to me many times…)
$ datalad install https://github.com/CONP-PCNO/conp-dataset.git #Step 1 - Installs the directory structure for the CONP
$ cd conp-dataset #Step 2 - Go to the directory
$ datalad install projects/preventad-open-bids #Step 3 - Install specifically the PREVENT-AD dataset you need
$ cd projects/preventad-open-bids/BIDS_dataset #Step 4 - Move to the BIDS directory you just installed (more details on this below)
If you now ls in the directory you are in, you will see the structure of the BIDS directory like so:
$ ls
dataset_description.json
participants.json
participants.tsv
README
sub-1000173
... #Etc.
You might think “Wow! That was so fast! Since when has my internet provider upgraded my internet to download hundreds of GBs in seconds/minutes?”.
Think again. Datalad works by downloading the directory structure of the PREVENT-AD Open BIDS dataset with instructions on how to download it, but the commands above do not actually download the data. It’s basically like if a turkey was missing its stuffing (a.k.a. the best part of turkey) but came with instructions on how to get the stuffing. If you do the following command, you should see this on your terminal:
$ ls -l
... #Etc.
participants.tsv -> ../.git/annex/objects/wV/kj/MD5E-s13031--ae9bb2da82c6ce203cffc76462810530.tsv/MD5E-s13031--ae9bb2da82c6ce203cffc76462810530.tsv
... #Etc.
Depending on your settings, you will likely see that the path after the arrow (../.git/annex…) will be flashing like a turn signal. Why? As I said before, Datalad doesn’t download the dataset itself with datalad install, but the skeleton of the directory where every single file is a soft symbolic link. I’m not going in detail here, but just know that while the data are still soft symbolic links, there are no brain scans available to you. You need to do the last step to get it to work, which is datalad get <filepath>.
This part is crucial, because datalad get will determine what data are downloaded to your computer and how much space it will ultimately take. It is also at this stage that you should have nearby your username and password to access the Prevent-AD dataset as it will be asked from you during the process. Remember when I mentioned to request an account above? Yeah… You won’t be able to touch these cool new data until that is approved. This is usually pretty quick.
From there, you can download any and all files you want from the dataset. If you want/need to download the entire dataset, you can simply use datalad get *. Otherwise, you can also ask Datalad to download specific files and modalities. For instance, datalad get sub-*/*/anat/* will download all the anatomical scans from all participant across all timepoints. datalad -r get sub-5555555 would download all files for participant 5555555. You get the idea. Basically, any type of wildcard pattern that would work in Bash should also work to download the data. If you only specify a higher root directory, then you should include -r to ensure datalad downloads all sub-files.
DRAC HPCs
Downloading data on an HPC
HPCs being these awesome shared resources across individuals, they also come with limitations to make it fair for all users. For example, compute nodes don’t have access to the internet (at least in most HPCs of the DRAC) and memory usage is limited when on a login node. This means that the datalad get command needs to be slightly tweaked before it can be used properly.
First, we are forced to use a login node to download the data as we need internet access. If you were to simply do datalad get *, you would be met with a download that seems to stall indefinitely and potentially a crash of the node (with no helpful message to figure it out). This one bugged me for a while, until I found out that another DRAC user had a similar issue. Basically, the trick is to force Datalad to only do one download job at a time, which doesn’t eat all the memory from the login node.
datalad get -r -J 1 sub-XXXXXX
Adding the -J 1 option forces Datalad to only launch one download at a time which helps the process along. However, note that the process may take several hours to complete. For instance, downloading all anatomical data within the Prevent-AD Open BIDS took close to 8h while downloading all the functional data took close to 15h.
A good way to bypass this is to launch a screen or tmux to launch datalad get, then detach from the screen. That way, even if you leave the server or your cat pours water on your computer, the download will follow along.
Once your files are downloaded, you will notice that if you do ls, you will seen a nearly identical screen as before:
$ ls -l
... #Etc.
participants.tsv -> ../.git/annex/objects/wV/kj/MD5E-s13031--ae9bb2da82c6ce203cffc76462810530.tsv/MD5E-s13031--ae9bb2da82c6ce203cffc76462810530.tsv
... #Etc.
However, you will see that the symbolic link is no longer flashing. This is because Datalad downloaded the file AND converted the soft symbolic link to a hard symbolic link. The difference regarding the type of symbolic link is not really important in this context, but just know that it allows you to copy the file to a different location without causing issues.
Tip
While not necessary, I do prefer to copy or move the folders Datalad collects to a different directory one datalad get finishes. One reason is that the path inside the conp-dataset directory is quite long and unnecessary. Another reason is that it simplifies the use of container apps to pre-process data.
And that should be it! From then you have all the data available to you. Enjoy!

