(download-data) = 
# Download Data

*Written by Luke Chang & Kevin Ortego*

Many of the imaging tutorials throughout this course will use open data from the Pinel Localizer task.

The Pinel Localizer task was designed to probe several different types of basic cognitive processes, such as visual perception, finger tapping, language, and math. Several of the tasks are cued by reading text on the screen (i.e., visual modality) and also by hearing auditory instructions (i.e., auditory modality). The trials are randomized across conditions and have been optimized to maximize efficiency for a rapid event related design. There are 100 trials in total over a 5-minute scanning session. Read the original [paper](https://bmcneurosci.biomedcentral.com/articles/10.1186/1471-2202-8-91) for more specific details about the task and the [dataset paper](https://doi.org/10.1016/j.neuroimage.2015.09.052). 

This dataset is well suited for these tutorials as it is (a) publicly available to anyone in the world, (b) relatively small (only about 5min), and (c) provides many options to create different types of contrasts.

There are a total of 94 subjects available, but we will primarily only be working with a smaller subset of about 15.

Though the data is being shared on the [OSF website](https://osf.io/vhtf6/files/), we recommend downloading it from our [g-node repository](https://gin.g-node.org/ljchang/Localizer) as we have fixed a few issues with BIDS formatting and have also performed preprocessing using fmriprep.

In this notebook, we will walk through how to access the datset using DataLad. Note, that the entire dataset is fairly large (~42gb), but the tutorials will mostly only be working with a small portion of the data (5.8gb), so there is no need to download the entire thing. If you are taking the Psych60 course at Dartmouth, we have already made the data available on the jupyterhub server.


## DataLad

The easist way to access the data is using [DataLad](https://www.datalad.org/), which is an open source version control system for data built on top of [git-annex](https://git-annex.branchable.com/). Think of it like git for data. It provides a handy command line interface for downloading data, tracking changes, and sharing it with others.

While DataLad offers a number of useful features for working with datasets, there are three in particular that we think make it worth the effort to install for this course.

1) Cloning a DataLad Repository can be completed with a single line of code `datalad clone <repository>` and provides the full directory structure in the form of symbolic links. This allows you to explore all of the files in the dataset, without having to download the entire dataset at once.

2) Specific files can be easily downloaded using `datalad get <filename>`, and files can be removed from your computer at any time using `datalad drop <filename>`. As these datasets are large, this will allow you to only work with the data that you need for a specific tutorial and you can drop the rest when you are done with it.

3) All of the DataLad commands can be run within Python using the datalad [python api](http://docs.datalad.org/en/latest/modref.html).

We will only be covering a few basic DataLad functions to get and drop data. We encourage the interested reader to read the very comprehensive DataLad [User Handbook](http://handbook.datalad.org/en/latest/) for more details and troubleshooting.

### Installing Datalad on Mac and Unix Operating Systems

DataLad can be easily installed using [pip](https://pip.pypa.io/en/stable/).

`pip install datalad`

Unfortunately, it currently requires manually installing the [git-annex](https://git-annex.branchable.com/) dependency, which is not automatically installed using pip.

If you are using OSX, we recommend installing git-annex using [homebrew](https://brew.sh/) package manager.

`brew install git-annex`

If you are on Debian/Ubuntu we recommend enabling the [NeuroDebian](http://neuro.debian.net/) repository and installing with apt-get.

`sudo apt-get install datalad`

For more installation options, we recommend reading the DataLad [installation instructions](https://git-annex.branchable.com/).


In [1]:
!pip install datalad



### Installing Datalad on Windows Operating Systems

Installing Datalad on Windows can be a little more tricky compared to Unix based operating systems and there are limited tutorials available. Hopefully, windows users will find this tutorial useful.

DataLad requires several components to work:
1. **Python**
2. **Git** 
3. **GitAnnex** 
4. **Datalad**

There is a good chance you may already have Python or Git installed on your computer. However, this may be problematic as DataLad requires specific configurations for both Python and Git installations in order to work. These are detailed on the DataLad website, but it can be easy to miss or skip over, especially if you already have some of these packages installed. Here isa summary of what you should check, as well as how to potentially resolve problems without having to reinstall things.  If you don't have Python or Git installed yet, you can follow these instructions and installation should be relatively straightforward.

#### 1) Python  

**If you need to install Python:**
The [Anaconda Distribution](https://www.anaconda.com/products/distribution) has the most relevant packages for scientific computing already included and is widely recommended. Be sure to get Python 3, and the default installer options are generally safe, except **be sure to select the *ADD PYTHON TO PATH* option**, otherwise Datalad will not work. After you're done, proceed to Step 2 on Git.

**If you already have Python installed on your computer:**  
+ You may run into problems when installing DataLad if you did not add Python to your Windows path when installing Python.
+ This is especially likely because the Anaconda distribution installer **strongly discourages** you from adding Python to the path when navigating through the installation dialogue.
+ You can check if Python is on your path by doing the following:
    + Press WindowsKey + x and click "System" in the menu that pops up
    + Scroll down to "Related settings" and click "Advanced system settings"
    + Under the "Advanced" tab, click "Environment Variables"
    + In the "User Variables" pane you should see a variable called "Path" with some values corresponding to the path of your Python installation.
+ If Python is on your path, you *should* be good to go.
+ If Python is not on your path, you have two options:
    1. Uninstall Python, then reinstall, being sure to select the add Python to Windows path option this time (this option is recommended and guaranteed to work)
    2. Try adding Python to your Windows path manually:
        + If you already installed DataLad, you should uninstall it and reinstall after doing this
        + Instructions for adding Python to your path can be found [here](https://datatofish.com/add-python-to-windows-path/), but what you need to add to the path may differ for different distributions.  For instance, my Anaconda distribution has several other folders listed in its path entry that are not listed for the more basic Python distribution used at the link above.  For completeness if you want to try on your own, my Path has these elements:
            
            ```
            C:\Users\MyUserName\anaconda3
            C:\Users\MyUserName\anaconda3\Library\mingw-w64\bin
            C:\Users\MyUserName\anaconda3\Library\usr\bin
            C:\Users\MyUserName\anaconda3\Library\bin
            C:\Users\MyUserName\anaconda3\Scripts
            ```

#### 2) Git

**If you don't have Git already installed**, it can be found [here](https://git-scm.com/download/win). The default installation options are recommended for most things, but be sure to configure the following options when installing:
- Enable *Use a TrueType font in all console windows*
- Select *Git from the command line and also from 3rd-party software*
- *Enable file system caching*
- *Enable symbolic links*

**If you already have Git installed** you should check your configuration settings. You can do so by opening the command prompt and typing:

    > git config --list
    
Somewhere in the list of variables that pops up you should see:

    core.fscache = true
    core.symlinks = true
    
If not, run the following commands from command prompt to change those settings:

    > git config --global core.symlinks true
    > git config --global core.fscache true
    
The ***Git from the command line and also from 3rd-party software*** option is the recommended setting during installation. To check, you can do one of two things:
1. Navigate to C:\Program Files\Git\etc\install-options, and check for the line "Path Option: Cmd" within that file, **OR**
2. You can check your Windows path to see if Git is on the path (follow the steps described above for checking if Python is on your Windows path). Git will appear under the "System variables" pane under "Path" instead of under the "User variables" pane.
    + If it isn't there, instructions for adding Git to the path can be found [here](https://www.delftstack.com/howto/git/add-git-to-path-on-windows/#:~:text=Click%20Environment%20Variables%20under%20System,%5Cbin%5Cgit.exe%20.) but this is untested as to whether it will work correctly, especially if you've already installed git-annex and DataLad.

Unfortunately, you cannot check whether the ***Use a TrueType font in all console windows*** was selected as far as I'm aware, but it is unclear what the implications of not doing that are and whether it would cause DataLad to not work.  If DataLad doesn't work for you once you get there, it is possible that you will need to reinstall git.

#### 3) Git Annex
**DO NOT INSTALL GIT ANNEX directly from their website**, because this does not seem to seem to currently work. The Windows installer is still in beta and that there are some known issues. Luckily you can use the git-annex installer provided by DataLad, which does work.  

Run these three commands from the command line to install git-annex:

    > pip install datalad-installer
    > datalad-installer git-annex -m datalad/packages
    > git config --global filter.annex.process "git-annex filter-process"
    
#### 4) Datalad
Installing datalad itself is easy too. Run the following in the command line:

    > pip install datalad
    
You are now ready to get started with DataLad! (after you read this [general Warning for windows users from DataLad](https://handbook.datalad.org/en/latest/intro/windows.html#ohnowindows)) And as a final tip, DataLad seems to work best on Windows when used via its [Python API](http://docs.datalad.org/en/latest/modref.html) which can be easily accessed in Python as follows:

    import datalad.api as dl

#### Windows Path Separators
    
When using the DataLad via the command line you will need to first navigate to the folder where the data was installed before you can download the data (this doesn't matter when using the Python API).

The **cd** command is used in the command prompt to **c**hange **d**irectory. You might notice how the path separators (`/` and `\\` ) are different in the first and second commands. This is a potential issue you might run into when using Windows vs Mac/Unix with Python. The backslash Windows file separator `\`  is different from the forward slash `/` used in other operating systems, or on URLs.  If you're ever using Python to run DataLad commands, or to load and save any kind of data in Python more generally, you may run into problems if you copy folder paths from Windows File Explorer into Python because they'll have the wrong separator. You can fix this in your Python scripts by switching all the `/` to `\`, or you can use a double `\\` which also works.

+ When you open the command prompt, you are in a default directory, which is displayed on the command line. Likely this is `C:\Users\YourUserName\\` and that will show up in the command line as:
 
        C:\Users\YourUserName> _
        (which is where the ">" before all the code lines comes from)
+ If your installed dataset lives at `C:\Users\YourUserName\ClassData\Localizer\\` you need to navigate to that directory using the cd command before the DataLad `get` command will work, which would look like this in the command line:

        C:\Users\YourUserName> cd ClassData\Localizer
        
+ Once you are in the data directory, you don't have to type the entire filepath, and you can run a command like
    
        datalad get sub-S01`

+ And in the command line that whole thing would be rendered like this:

        C:\Users\YourUserName\ClassData\Localizer> datalad get sub-S01


## Download Data with DataLad

The Pinel localizer dataset can be accessed at the following location https://gin.g-node.org/ljchang/Localizer/. To download the Localizer dataset run `datalad install https://gin.g-node.org/ljchang/Localizer` in a terminal in the location where you would like to install the dataset. Don't forget to change the directory to a folder on your local computer. The full dataset is approximately 42gb.

You can run this from the notebook using the `!` cell magic.

In [3]:
%cd ~/Dropbox/Dartbrains/data

!datalad install https://gin.g-node.org/ljchang/Localizer

/Users/lukechang/Dropbox/Dartbrains/data
[0m

## Datalad Basics

You might be surprised to find that after cloning the dataset that it barely takes up any space `du -sh`. This is because cloning only downloads the metadata of the dataset to see what files are included.

You can check to see how big the entire dataset would be if you downloaded everything using `datalad status`.

In [3]:
%cd ~/Dropbox/Dartbrains/data/Localizer

!datalad status --annex

/Users/lukechang/Dropbox/Dartbrains/data/Localizer
1794 annex'd files (42.1 GB recorded total size)
[0m

### Getting Data
One of the really nice features of datalad is that you can see all of the data without actually storing it on your computer. When you want a specific file you use `datalad get <filename>` to download that specific file. Importantly, you do not need to download all of the dat at once, only when you need it.

Now that we have cloned the repository we can grab individual files. For example, suppose we wanted to grab the first subject's confound regressors generated by fmriprep.

In [4]:
!datalad get participants.tsv

[0m

Now we can check and see how much of the total dataset we have downloaded using `datalad status`

In [7]:
!datalad status --annex all

1794 annex'd files (0.0 B/42.1 GB present/total size)
[0m

If you would like to download all of the files you can use `datalad get .`. Depending on the size of the dataset and the speed of your internet connection, this might take awhile. One really nice thing about datalad is that if your connection is interrupted you can simply run `datalad get .` again, and it will resume where it left off.

You can also install the dataset and download all of the files with a single command `datalad install -g https://gin.g-node.org/ljchang/Localizer`. You may want to do this if you have a lot of storage available and a fast internet connection. For most people, we recommend only downloading the files you need for a specific tutorial.

### Dropping Data
Most people do not have unlimited space on their hard drives and are constantly looking for ways to free up space when they are no longer actively working with files. Any file in a dataset can be removed using `datalad drop`. Importantly, this does not delete the file, but rather removes it from your computer. You will still be able to see file metadata after it has been dropped in case you want to download it again in the future.

As an example, let's drop the Localizer participants .tsv file.

In [8]:
!datalad drop participants.tsv

[0m

## Datalad has a Python API!
One particularly nice aspect of datalad is that it has a Python API, which means that anything you would like to do with datalad in the commandline, can also be run in Python. See the details of the datalad [Python API](http://docs.datalad.org/en/latest/modref.html).

For example, suppose you would like to clone a data repository, such as the Localizer dataset. You can run `dl.clone(source=url, path=location)`. Make sure you set `localizer_path` to the location where you would like the Localizer repository installed.

In [5]:
import os
import glob
import datalad.api as dl
import pandas as pd

localizer_path = '/Users/lukechang/Dropbox/Dartbrains/data/Localizer'

dl.clone(source='https://gin.g-node.org/ljchang/Localizer', path=localizer_path)




<Dataset path=/Users/lukechang/Dropbox/Dartbrains/data/Localizer>

We can now create a dataset instance using `dl.Dataset(path_to_data)`.

In [6]:
ds = dl.Dataset(localizer_path)

How much of the dataset have we downloaded?  We can check the status of the annex using `ds.status(annex='all')`.

In [12]:
results = ds.status(annex='all')

1794 annex'd files (0.0 B/42.1 GB present/total size)
1794 annex'd files (0.0 B/42.1 GB present/total size)


Looks like it's empty, which makes sense since we only cloned the dataset. 

Now we need to get some data. Let's start with something small to play with first.

Let's use `glob` to find all of the tab-delimited confound data generated by fmriprep. 

In [14]:
file_list = glob.glob(os.path.join(localizer_path, '*', 'fmriprep', '*', 'func', '*tsv'))
file_list.sort()
file_list[:10]

['/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S01/func/sub-S01_task-localizer_desc-confounds_regressors.tsv',
 '/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S02/func/sub-S02_task-localizer_desc-confounds_regressors.tsv',
 '/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S03/func/sub-S03_task-localizer_desc-confounds_regressors.tsv',
 '/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S04/func/sub-S04_task-localizer_desc-confounds_regressors.tsv',
 '/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S05/func/sub-S05_task-localizer_desc-confounds_regressors.tsv',
 '/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S06/func/sub-S06_task-localizer_desc-confounds_regressors.tsv',
 '/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S07/func/sub-S07_task-localizer_desc-confounds_regressors.tsv',
 '/Use

glob can search the filetree and see all of the relevant data even though none of it has been downloaded yet.

Let's now download the first subjects confound regressor file and load it using pandas.

In [15]:
result = ds.get(file_list[0])

confounds = pd.read_csv(file_list[0], sep='\t')
confounds.head()

Unnamed: 0,csf,csf_derivative1,csf_derivative1_power2,csf_power2,white_matter,white_matter_derivative1,white_matter_power2,white_matter_derivative1_power2,global_signal,global_signal_derivative1,...,rot_x_derivative1_power2,rot_x_power2,rot_y,rot_y_derivative1,rot_y_derivative1_power2,rot_y_power2,rot_z,rot_z_derivative1,rot_z_derivative1_power2,rot_z_power2
0,5164.630182,,,26673400.0,4006.007667,,16048100.0,,3753.537871,,...,,4.016403e-07,0.000344,,,1.180596e-07,-0.000701,,,4.914346e-07
1,5178.481411,13.851229,191.856548,26816670.0,4011.819383,5.811716,16094690.0,33.776043,3760.408417,6.870546,...,8.62298e-09,2.925631e-07,0.000569,0.000225,5.063355e-08,3.233253e-07,-0.000776,-7.5e-05,5.666476e-09,6.026417e-07
2,5161.040643,-17.440768,304.180395,26636340.0,4006.766409,-5.052974,16054180.0,25.532548,3756.426086,-3.982332,...,6.975673e-08,6.480347e-07,0.000655,8.6e-05,7.409422e-09,4.286255e-07,-0.000524,0.000253,6.390582e-08,2.740564e-07
3,5150.604178,-10.436465,108.919794,26528720.0,4008.586021,1.819612,16068760.0,3.310987,3751.56609,-4.859996,...,1.673784e-07,1.567265e-07,0.000554,-0.000101,1.011674e-08,3.070412e-07,-0.000605,-8.2e-05,6.72236e-09,3.66623e-07
4,5172.441161,21.836983,476.85381,26754150.0,4007.189291,-1.39673,16057570.0,1.950854,3746.2982,-5.26789,...,2.102616e-08,2.925631e-07,0.000997,0.000443,1.959195e-07,9.934926e-07,-0.00084,-0.000235,5.510428e-08,7.059982e-07


What if we wanted to drop that file? Just like the CLI, we can use `ds.drop(file_name)`.

In [16]:
result = ds.drop(file_list[0])

To confirm that it is actually removed, let's try to load it again with pandas.

In [17]:
confounds = pd.read_csv(file_list[0], sep='\t')


Looks like it was successfully removed.

We can also load the entire dataset in one command if want using `ds.get(dataset='.', recursive=True)`. We are not going to do it right now as this will take awhile and require lots of free hard disk space.

Let's actually download one of the files we will be using in the tutorial. First, let's use glob to get a list of all of the functional data that has been preprocessed by fmriprep, denoised, and smoothed.

In [18]:
file_list = glob.glob(os.path.join(localizer_path, 'derivatives', 'fmriprep', '*', 'func', '*task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz'))
file_list.sort()
file_list

['/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S01/func/sub-S01_task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz',
 '/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S02/func/sub-S02_task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz',
 '/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S03/func/sub-S03_task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz',
 '/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S04/func/sub-S04_task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz',
 '/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S05/func/sub-S05_task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz',
 '/Users/lukechang/Dropbox/Dartbrains/data/Localizer/derivatives/fmriprep/sub-S06/func/sub-S06_task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz',
 '/Users/lukechang/Dro

Now let's download the first subject's file using `ds.get()`. This file is 825mb, so this might take a few minutes depending on your internet speed.

In [19]:
result = ds.get(file_list[0])

HBox(children=(FloatProgress(value=0.0, description='derivatives .. bold.nii.gz', max=112128274.0, style=Progr…

How much of the dataset have we downloaded?  We can check the status of the annex using `ds.status(annex='all')`.

In [21]:
result = ds.status(annex='all')

1794 annex'd files (106.9 MB/42.1 GB present/total size)
1794 annex'd files (106.9 MB/42.1 GB present/total size)


## Download Data for Course
Now let's download the data we will use for the course. We will download:
- `sub-S01`'s raw data
- experimental metadata
- preprocessed data for the first 20 subjects including the fmriprep QC reports.


In [17]:
result = ds.get(os.path.join(localizer_path, 'sub-S01'))
result = ds.get(glob.glob(os.path.join(localizer_path, '*.json')))
result = ds.get(glob.glob(os.path.join(localizer_path, '*.tsv')))
result = ds.get(glob.glob(os.path.join(localizer_path, 'phenotype')))

In [16]:
file_list = glob.glob(os.path.join(localizer_path, '*', 'fmriprep', 'sub*'))
file_list.sort()
for f in file_list[:20]:
    result = ds.get(f)

To get the python packages for the course be sure to read the installation {ref}`instructions <python-packages>` in the {doc}`../content/Introduction_to_JupyterHub` tutorial.

(run-preprocessing)= 
## Preprocessing
The data has already been preprocessed using [fmriprep](https://fmriprep.readthedocs.io/en/stable/), which is a robust, but opinionated automated preprocessing pipeline developed by [Russ Poldrack's group at Stanford University](https://poldracklab.stanford.edu/). The developer's have made a number of choices about how to preprocess your fMRI data using best practices and have created an automated pipeline using multiple software packages that are all distributed via a [docker container](https://fmriprep.org/en/1.5.9/docker.html).

Though, you are welcome to just start working right away with the preprocessed data, here are the steps to run it yourself:

 - 1. Install [Docker](https://www.docker.com/) and download image
     
     `docker pull poldracklab/fmriprep:<latest-version>`


 - 2. Run a single command in the terminal specifying the location of the data, the location of the output, the participant id, and a few specific flags depending on specific details of how you want to run the preprocessing.

    `fmriprep-docker /Users/lukechang/Dropbox/Dartbrains/Data/localizer /Users/lukechang/Dropbox/Dartbrains/Data/preproc participant --participant_label sub-S01 --write-graph --fs-no-reconall --notrack --fs-license-file ~/Dropbox/Dartbrains/License/license.txt --work-dir /Users/lukechang/Dropbox/Dartbrains/Data/work`
    
In practice, it's alway a little bit finicky to get everything set up on a particular system. Sometimes you might run into issues with a specific missing file like the [freesurfer license](https://fmriprep.readthedocs.io/en/stable/usage.html#the-freesurfer-license) even if you're not using it. You might also run into issues with the format of the data that might have some conflicts with the [bids-validator](https://github.com/bids-standard/bids-validator). In our experience, there is always some frustrations getting this to work, but it's very nice once it's done.