Skip to content

Download Data

Written by Luke Chang & Kevin Ortego

Many of the imaging tutorials throughout this course will use open data from the Pinel Localizer task.

The Pinel Localizer task was designed to probe several different types of basic cognitive processes, such as visual perception, finger tapping, language, and math. Several of the tasks are cued by reading text on the screen (i.e., visual modality) and also by hearing auditory instructions (i.e., auditory modality). The trials are randomized across conditions and have been optimized to maximize efficiency for a rapid event related design. There are 100 trials in total over a 5-minute scanning session. Read the original paper for more specific details about the task and the dataset paper.

This dataset is well suited for these tutorials as it is (a) publicly available to anyone in the world, (b) relatively small (only about 5min), and (c) provides many options to create different types of contrasts.

There are a total of 94 subjects available, but we will primarily only be working with a smaller subset of about 15.

Though the data is being shared on the OSF website, we recommend downloading it from our huggingface repository as we have fixed a few issues with BIDS formatting and have also performed preprocessing using fmriprep. We also have a legacy g-node repository that uses datalad, but it is currently very slow.

In this notebook, we will walk through how to access the datset using DataLad. Note, that the entire dataset is fairly large (~42gb), but the tutorials will mostly only be working with a small portion of the data (5.8gb), so there is no need to download the entire thing. If you are taking the Psych60 course at Dartmouth, we have already made the data available on the jupyterhub server.

The Pinel Localizer dataset is hosted on HuggingFace. This is the recommended way to access the data for this course. Files are downloaded automatically and cached locally — no extra tools needed.

Using the Course Helper Module

The dartbrains_tools.data module provides convenient functions to download and access any file in the dataset:

from dartbrains_tools.data import get_file, get_subjects, load_events, get_tr, REPO_ID

# List all subjects
print(f"Subjects: {get_subjects()}")
print(f"TR: {get_tr()} seconds")

# Download a preprocessed BOLD file (cached after first download)
bold_path = get_file('S01', 'derivatives', 'bold')
print(f"\nBOLD file path: {bold_path}")
Subjects: ['S01', 'S02', 'S03', 'S04', 'S05', 'S06', 'S07', 'S08', 'S09', 'S10', 'S11', 'S12', 'S13', 'S14', 'S15', 'S16', 'S17', 'S18', 'S19', 'S20']
TR: 2.4 seconds

BOLD file path: /home/runner/.cache/huggingface/hub/datasets--dartbrains--localizer/snapshots/493f7614c8b7cdc0593a89eb0635f10669b30a10/derivatives/fmriprep/sub-S01/func/sub-S01_task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz

Loading Event Timing Data

Each subject's task events (stimulus onsets, durations, and conditions) can be loaded directly as a DataFrame:

events = load_events('S01')
events.head(10)
onsetdurationtrial_type
00.01video_computation
12.41video_computation
28.71horizontal_checkerboard
311.41audio_right_hand
415.01audio_sentence
518.01video_right_hand
620.71audio_sentence
723.71audio_left_hand
826.71video_left_hand
929.71audio_sentence

Direct File Access

You can also download any file directly from HuggingFace using hf_hub_download:

from huggingface_hub import hf_hub_download

# Download a specific beta map
path = hf_hub_download(
    repo_id=REPO_ID,
    filename="derivatives/betas/S01_betas.nii.gz",
    repo_type="dataset",
)
print(f"Downloaded to: {path}")
Downloaded to: /home/runner/.cache/huggingface/hub/datasets--dartbrains--localizer/snapshots/493f7614c8b7cdc0593a89eb0635f10669b30a10/derivatives/betas/S01_betas.nii.gz

Browsing the Dataset as a BIDS Tree

hf_hub_download and get_file() cache files in ~/.cache/huggingface/hub/datasets--dartbrains--localizer/, but the cache uses a content-addressed layout (blobs/ for raw bytes, snapshots/<commit>/ for symlinks back to those blobs with their original filenames). The snapshots/ folder does preserve the original BIDS tree exactly, but the path is awkward to access. If you'd rather browse the dataset like a normal BIDS directory — cd into it, ls subjects, drag it into a file explorer, point external tools at it — the cleanest pattern is to download a full snapshot and symlink it to a friendly location of your choice. First, pull the snapshot. Files you've already cached with get_file() or hf_hub_download are reused, so this is fast on a second call:
from huggingface_hub import snapshot_download

snapshot_path = snapshot_download(
    repo_id=REPO_ID,
    repo_type="dataset",
)
print(f"Snapshot lives at:\n  {snapshot_path}")
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

Fetching ... files: 0it [00:00, ?it/s]
Fetching ... files: 6it [00:00, 24.78it/s]
Fetching ... files: 12it [00:00, 24.30it/s]
Fetching ... files: 22it [00:00, 33.41it/s]
Fetching ... files: 26it [00:00, 26.91it/s]
Fetching ... files: 29it [00:02,  7.81it/s]
Fetching ... files: 38it [00:02, 12.60it/s]
Fetching ... files: 60it [00:03, 17.35it/s]
Fetching ... files: 71it [00:04, 14.64it/s]
Fetching ... files: 93it [00:04, 25.61it/s]
Fetching ... files: 104it [00:05, 19.78it/s]
Fetching ... files: 115it [00:05, 22.33it/s]
Fetching ... files: 126it [00:07, 15.92it/s]
Fetching ... files: 137it [00:07, 17.13it/s]
Fetching ... files: 148it [00:07, 20.59it/s]
Fetching ... files: 159it [00:08, 26.93it/s]
Fetching ... files: 170it [00:08, 26.49it/s]
Fetching ... files: 181it [00:09, 20.77it/s]
Fetching ... files: 192it [00:09, 19.07it/s]
Fetching ... files: 203it [00:10, 22.60it/s]
Fetching ... files: 214it [00:10, 19.57it/s]
Fetching ... files: 238it [00:11, 26.04it/s]
Fetching ... files: 242it [00:12, 19.85it/s]
Fetching ... files: 272it [00:12, 28.37it/s]
Fetching ... files: 276it [00:13, 24.17it/s]
Fetching ... files: 298it [00:14, 21.39it/s]
Fetching ... files: 332it [00:15, 22.38it/s]
Fetching ... files: 366it [00:17, 20.04it/s]
Fetching ... files: 400it [00:18, 24.18it/s]
Fetching ... files: 434it [00:19, 27.83it/s]
Fetching ... files: 443it [00:21, 20.11it/s]
Fetching ... files: 468it [00:22, 20.33it/s]
Fetching ... files: 502it [00:23, 23.40it/s]
Fetching ... files: 536it [00:24, 26.04it/s]
Fetching ... files: 544it [00:24, 25.48it/s]
Fetching ... files: 570it [00:26, 22.67it/s]
Fetching ... files: 579it [00:26, 21.33it/s]
Fetching ... files: 604it [00:27, 26.30it/s]
Fetching ... files: 612it [00:27, 24.79it/s]
Fetching ... files: 638it [00:29, 18.83it/s]
Fetching ... files: 672it [00:30, 24.27it/s]
Fetching ... files: 706it [00:31, 24.12it/s]
Fetching ... files: 740it [00:32, 29.69it/s]
Fetching ... files: 748it [00:32, 31.04it/s]
Fetching ... files: 774it [00:33, 27.62it/s]
Fetching ... files: 808it [00:35, 24.01it/s]
Fetching ... files: 842it [00:36, 26.98it/s]
Fetching ... files: 876it [00:37, 25.50it/s]
Fetching ... files: 910it [00:39, 24.83it/s]
Fetching ... files: 944it [00:41, 22.33it/s]
Fetching ... files: 977it [00:42, 25.60it/s]
Fetching ... files: 1010it [00:43, 25.80it/s]
Fetching ... files: 1017it [00:43, 25.70it/s]
Fetching ... files: 1043it [00:44, 24.09it/s]
Fetching ... files: 1076it [00:47, 20.26it/s]
Fetching ... files: 1109it [00:48, 23.26it/s]
Fetching ... files: 1142it [00:48, 27.04it/s]
Fetching ... files: 1150it [00:49, 26.72it/s]
Fetching ... files: 1175it [00:50, 23.78it/s]
Fetching ... files: 1183it [00:50, 24.68it/s]
Fetching ... files: 1208it [00:52, 20.29it/s]
Fetching ... files: 1241it [00:53, 21.40it/s]
Fetching ... files: 1249it [00:53, 22.92it/s]
Fetching ... files: 1274it [00:54, 28.38it/s]
Fetching ... files: 1282it [00:54, 25.92it/s]
Fetching ... files: 1307it [00:55, 29.32it/s]
Fetching ... files: 1314it [00:56, 20.31it/s]
Fetching ... files: 1340it [00:57, 21.30it/s]
Fetching ... files: 1373it [00:58, 25.24it/s]
Fetching ... files: 1406it [00:59, 26.43it/s]
Fetching ... files: 1414it [01:00, 27.80it/s]
Fetching ... files: 1439it [01:01, 23.71it/s]
Fetching ... files: 1446it [01:01, 25.54it/s]
Fetching ... files: 1472it [01:03, 17.44it/s]
Fetching ... files: 1505it [01:04, 24.43it/s]
Fetching ... files: 1538it [01:05, 25.49it/s]
Fetching ... files: 1571it [01:07, 24.35it/s]
Fetching ... files: 1604it [01:08, 24.31it/s]
Fetching ... files: 1637it [01:09, 27.36it/s]
Fetching ... files: 1645it [01:09, 28.05it/s]
Fetching ... files: 1670it [01:10, 25.18it/s]
Fetching ... files: 1678it [01:11, 25.15it/s]
Fetching ... files: 1703it [01:12, 21.07it/s]
Fetching ... files: 1736it [01:13, 24.40it/s]
Fetching ... files: 1769it [01:14, 25.10it/s]
Fetching ... files: 1802it [01:15, 26.43it/s]
Fetching ... files: 1810it [01:16, 26.98it/s]
Fetching ... files: 1835it [01:16, 34.64it/s]
Fetching ... files: 1843it [01:16, 33.80it/s]
Fetching ... files: 1868it [01:19, 17.54it/s]
Fetching ... files: 1901it [01:19, 25.65it/s]
Fetching ... files: 1930it [01:19, 36.16it/s]
Fetching ... files: 1939it [01:21, 21.01it/s]
Fetching ... files: 1967it [01:22, 23.18it/s]
Fetching ... files: 2000it [01:23, 26.53it/s]
Fetching ... files: 2007it [01:23, 25.25it/s]
Fetching ... files: 2033it [01:24, 26.34it/s]
Fetching ... files: 2041it [01:24, 28.85it/s]
Fetching ... files: 2066it [01:27, 18.24it/s]
Fetching ... files: 2099it [01:27, 26.51it/s]
Fetching ... files: 2132it [01:28, 26.01it/s]
Fetching ... files: 2165it [01:30, 24.93it/s]
Fetching ... files: 2198it [01:31, 23.60it/s]
Fetching ... files: 2231it [01:32, 27.23it/s]
Fetching ... files: 2238it [01:32, 27.20it/s]
Fetching ... files: 2264it [01:34, 21.67it/s]
Fetching ... files: 2297it [01:35, 25.70it/s]
Fetching ... files: 2305it [01:35, 27.62it/s]
Fetching ... files: 2330it [01:36, 24.99it/s]
Fetching ... files: 2337it [01:37, 26.38it/s]
Fetching ... files: 2363it [01:38, 22.72it/s]
Fetching ... files: 2396it [01:39, 23.58it/s]
Fetching ... files: 2429it [01:41, 23.39it/s]
Fetching ... files: 2436it [01:41, 22.42it/s]
Fetching ... files: 2462it [01:42, 25.43it/s]
Fetching ... files: 2495it [01:43, 26.34it/s]
Fetching ... files: 2503it [01:43, 25.28it/s]
Fetching ... files: 2528it [01:44, 30.24it/s]
Fetching ... files: 2535it [01:44, 30.67it/s]
Fetching ... files: 2561it [01:46, 25.30it/s]
Fetching ... files: 2568it [01:46, 26.03it/s]
Fetching ... files: 2594it [01:47, 22.45it/s]
Fetching ... files: 2627it [01:48, 26.11it/s]
Fetching ... files: 2634it [01:48, 26.76it/s]
Fetching ... files: 2660it [01:50, 22.17it/s]
Fetching ... files: 2693it [01:51, 25.80it/s]
Fetching ... files: 2701it [01:51, 24.61it/s]
Fetching ... files: 2726it [01:53, 17.60it/s]
Fetching ... files: 2759it [01:54, 21.26it/s]
Fetching ... files: 2792it [01:55, 31.16it/s]
Fetching ... files: 2799it [01:55, 27.84it/s]
Fetching ... files: 2805it [01:55, 27.08it/s]
Fetching ... files: 2825it [01:57, 19.72it/s]
Fetching ... files: 2858it [01:58, 24.33it/s]
Fetching ... files: 2891it [01:59, 28.42it/s]
Fetching ... files: 2899it [01:59, 28.93it/s]
Fetching ... files: 2924it [02:01, 21.10it/s]HTTP Error 429 thrown while requesting HEAD https://huggingface.co/datasets/dartbrains/localizer/resolve/493f7614c8b7cdc0593a89eb0635f10669b30a10/derivatives/fmriprep/sub-S84/anat/sub-S84_label-GM_probseg.nii.gz
Rate limited. Waiting 48.0s before retry [Retry 1/5].
HTTP Error 429 thrown while requesting HEAD https://huggingface.co/datasets/dartbrains/localizer/resolve/493f7614c8b7cdc0593a89eb0635f10669b30a10/derivatives/fmriprep/sub-S84/anat/sub-S84_label-WM_probseg.nii.gz
Rate limited. Waiting 48.0s before retry [Retry 1/5].
HTTP Error 429 thrown while requesting HEAD https://huggingface.co/datasets/dartbrains/localizer/resolve/493f7614c8b7cdc0593a89eb0635f10669b30a10/derivatives/fmriprep/sub-S84/anat/sub-S84_space-MNI152NLin2009cAsym_desc-brain_mask.json
Rate limited. Waiting 48.0s before retry [Retry 1/5].
HTTP Error 429 thrown while requesting HEAD https://huggingface.co/datasets/dartbrains/localizer/resolve/493f7614c8b7cdc0593a89eb0635f10669b30a10/derivatives/fmriprep/sub-S84/anat/sub-S84_space-MNI152NLin2009cAsym_desc-brain_mask.nii.gz
Rate limited. Waiting 48.0s before retry [Retry 1/5].
HTTP Error 429 thrown while requesting HEAD https://huggingface.co/datasets/dartbrains/localizer/resolve/493f7614c8b7cdc0593a89eb0635f10669b30a10/derivatives/fmriprep/sub-S84/anat/sub-S84_space-MNI152NLin2009cAsym_desc-preproc_T1w.json
Rate limited. Waiting 47.0s before retry [Retry 1/5].
HTTP Error 429 thrown while requesting HEAD https://huggingface.co/datasets/dartbrains/localizer/resolve/493f7614c8b7cdc0593a89eb0635f10669b30a10/derivatives/fmriprep/sub-S84/anat/sub-S84_space-MNI152NLin2009cAsym_desc-preproc_T1w.nii.gz
Rate limited. Waiting 47.0s before retry [Retry 1/5].

Fetching ... files: 2957it [02:03, 19.60it/s]HTTP Error 429 thrown while requesting HEAD https://huggingface.co/datasets/dartbrains/localizer/resolve/493f7614c8b7cdc0593a89eb0635f10669b30a10/derivatives/fmriprep/sub-S84/anat/sub-S84_space-MNI152NLin2009cAsym_dseg.nii.gz
Rate limited. Waiting 47.0s before retry [Retry 1/5].

Fetching ... files: 2998it [02:03, 26.49it/s]HTTP Error 429 thrown while requesting HEAD https://huggingface.co/datasets/dartbrains/localizer/resolve/493f7614c8b7cdc0593a89eb0635f10669b30a10/derivatives/fmriprep/sub-S84/anat/sub-S84_space-MNI152NLin2009cAsym_label-CSF_probseg.nii.gz
Rate limited. Waiting 46.0s before retry [Retry 1/5].

Fetching ... files: 2998it [02:14, 26.49it/s]
Fetching ... files: 3000it [02:50,  1.46it/s]
Fetching ... files: 3001it [02:50,  1.48it/s]
Fetching ... files: 3004it [02:50,  1.58it/s]
Fetching ... files: 3023it [02:52,  2.51it/s]
Fetching ... files: 3031it [02:52,  3.08it/s]
Fetching ... files: 3056it [02:53,  4.99it/s]
Fetching ... files: 3083it [02:55,  7.26it/s]
Fetching ... files: 3123it [02:56, 11.08it/s]
Fetching ... files: 3156it [02:57, 14.96it/s]
Fetching ... files: 3189it [02:59, 15.68it/s]
Fetching ... files: 3222it [03:00, 19.07it/s]
Fetching ... files: 3255it [03:02, 17.18it/s]
Fetching ... files: 3288it [03:03, 23.35it/s]
Fetching ... files: 3321it [03:04, 23.84it/s]
Fetching ... files: 3354it [03:05, 23.62it/s]
Fetching ... files: 3433it [03:05, 46.42it/s]
Fetching ... files: 3454it [03:07, 33.10it/s]
Fetching ... files: 3469it [03:08, 30.67it/s]
Fetching ... files: 3481it [03:09, 25.44it/s]
Fetching ... files: 3490it [03:09, 22.79it/s]
Fetching ... files: 3497it [03:09, 24.61it/s]
Fetching ... files: 3503it [03:10, 21.79it/s]
Fetching ... files: 3508it [03:10, 20.19it/s]
Fetching ... files: 3512it [03:11, 15.62it/s]
Fetching ... files: 3520it [03:11, 16.87it/s]
Fetching ... files: 3526it [03:12, 14.37it/s]
Fetching ... files: 3529it [03:12, 14.91it/s]
Fetching ... files: 3535it [03:12, 17.35it/s]
Fetching ... files: 3541it [03:13, 10.47it/s]
Fetching ... files: 3559it [03:14, 14.90it/s]
Fetching ... files: 3562it [03:14, 14.48it/s]
Fetching ... files: 3565it [03:15, 12.12it/s]
Fetching ... files: 3580it [03:16, 14.09it/s]
Fetching ... files: 3586it [03:16, 16.57it/s]
Fetching ... files: 3595it [03:17, 14.40it/s]
Fetching ... files: 3598it [03:17, 14.20it/s]
Fetching ... files: 3604it [03:17, 15.75it/s]
Fetching ... files: 3610it [03:17, 18.89it/s]
Fetching ... files: 3613it [03:18, 12.65it/s]
Fetching ... files: 3616it [03:19,  8.73it/s]
Fetching ... files: 3634it [03:19, 17.08it/s]
Fetching ... files: 3640it [03:20, 17.15it/s]
Fetching ... files: 3642it [03:20, 18.20it/s]
Snapshot lives at:
  /home/runner/.cache/huggingface/hub/datasets--dartbrains--localizer/snapshots/493f7614c8b7cdc0593a89eb0635f10669b30a10

Now create a symlink from somewhere convenient (e.g. ~/data/localizer) pointing at the snapshot. The symlink takes ~no disk space and lets you treat the cached data as if it lived in ~/data/localizer:

bids_root = Path.home() / "data" / "localizer"
bids_root.parent.mkdir(parents=True, exist_ok=True)

if bids_root.exists() or bids_root.is_symlink():
    bids_root.unlink()  # replace any stale symlink
bids_root.symlink_to(snapshot_path)

print(f"Browse the BIDS tree at: {bids_root}")
NameError: name 'Path' is not defined

Traceback (most recent call last):
  File "", line 1, in <module>
    bids_root = Path.home() / "data" / "localizer"
                ^^^^
NameError: name 'Path' is not defined. Did you mean: 'path'?
# Sanity check — list the top-level entries
for _entry in sorted(bids_root.iterdir())[:10]:
    print(_entry.name)
Ancestor raised: An ancestor raised an exception (NameError)

A few practical notes:

  • Reuses the cache. Both the snapshot folder and your symlink target ultimately point at the same content-addressed blobs. Files aren't duplicated, and huggingface_hub won't re-download anything you already have.
  • Lazy fetch + symlink in one step. If you'd rather have huggingface_hub materialize the tree directly at your chosen path (without going through snapshot_download's default cache location), pass local_dir= and local_dir_use_symlinks=True:
    snapshot_download(
        repo_id="dartbrains/localizer", repo_type="dataset",
        local_dir="~/data/localizer", local_dir_use_symlinks=True,
    )
    
    With local_dir_use_symlinks=True (the default on macOS/Linux), ~/data/localizer will contain symlinks to the cached blobs — same end state, no extra disk usage.
  • Windows caveat. Symlinks on Windows require either developer mode enabled or admin privileges. If you hit a permission error, pass local_dir_use_symlinks=False to copy the bytes instead (uses ~the size of the snapshot in extra disk space).
  • Updating. If the dataset is updated on HuggingFace, re-run snapshot_download — it pulls only the changed blobs and updates the snapshot folder. Your symlink still points to the right place.

Bulk Loading with the datasets Library

For loading all beta maps or events at once, use the datasets library:

from datasets import load_dataset

ds = load_dataset("dartbrains/localizer", "betas")
print(f"Loaded {len(ds['train'])} beta maps")
print(f"First entry: subject={ds['train'][0]['subject']}, condition={ds['train'][0]['condition']}")
multiple-defs: The variable 'ds' was defined by another cell

Downloading Data with DataLad (Legacy)

The dataset is also available via DataLad from the GIN repository. This was the original download method and still works as an alternative.

DataLad

The easist way to access the data is using DataLad, which is an open source version control system for data built on top of git-annex. Think of it like git for data. It provides a handy command line interface for downloading data, tracking changes, and sharing it with others.

While DataLad offers a number of useful features for working with datasets, there are three in particular that we think make it worth the effort to install for this course.

1) Cloning a DataLad Repository can be completed with a single line of code datalad clone <repository> and provides the full directory structure in the form of symbolic links. This allows you to explore all of the files in the dataset, without having to download the entire dataset at once.

2) Specific files can be easily downloaded using datalad get <filename>, and files can be removed from your computer at any time using datalad drop <filename>. As these datasets are large, this will allow you to only work with the data that you need for a specific tutorial and you can drop the rest when you are done with it.

3) All of the DataLad commands can be run within Python using the datalad python api.

We will only be covering a few basic DataLad functions to get and drop data. We encourage the interested reader to read the very comprehensive DataLad User Handbook for more details and troubleshooting.

Installing Datalad on Mac and Unix Operating Systems

DataLad can be easily installed using pip.

pip install datalad

Unfortunately, it currently requires manually installing the git-annex dependency, which is not automatically installed using pip.

If you are using OSX, we recommend installing git-annex using homebrew package manager.

brew install git-annex

If you are on Debian/Ubuntu we recommend enabling the NeuroDebian repository and installing with apt-get.

sudo apt-get install datalad

For more installation options, we recommend reading the DataLad installation instructions.

# packages added via marimo's package management: datalad !pip install datalad

Installing Datalad on Windows Operating Systems

Installing Datalad on Windows can be a little more tricky compared to Unix based operating systems and there are limited tutorials available. Hopefully, windows users will find this tutorial useful.

DataLad requires several components to work: 1. Python 2. Git 3. GitAnnex 4. Datalad

There is a good chance you may already have Python or Git installed on your computer. However, this may be problematic as DataLad requires specific configurations for both Python and Git installations in order to work. These are detailed on the DataLad website, but it can be easy to miss or skip over, especially if you already have some of these packages installed. Here isa summary of what you should check, as well as how to potentially resolve problems without having to reinstall things. If you don't have Python or Git installed yet, you can follow these instructions and installation should be relatively straightforward.

1) Python

If you need to install Python: The Anaconda Distribution has the most relevant packages for scientific computing already included and is widely recommended. Be sure to get Python 3, and the default installer options are generally safe, except be sure to select the ADD PYTHON TO PATH option, otherwise Datalad will not work. After you're done, proceed to Step 2 on Git.

If you already have Python installed on your computer: + You may run into problems when installing DataLad if you did not add Python to your Windows path when installing Python. + This is especially likely because the Anaconda distribution installer strongly discourages you from adding Python to the path when navigating through the installation dialogue. + You can check if Python is on your path by doing the following: + Press WindowsKey + x and click "System" in the menu that pops up + Scroll down to "Related settings" and click "Advanced system settings" + Under the "Advanced" tab, click "Environment Variables" + In the "User Variables" pane you should see a variable called "Path" with some values corresponding to the path of your Python installation. + If Python is on your path, you should be good to go. + If Python is not on your path, you have two options: 1. Uninstall Python, then reinstall, being sure to select the add Python to Windows path option this time (this option is recommended and guaranteed to work) 2. Try adding Python to your Windows path manually: + If you already installed DataLad, you should uninstall it and reinstall after doing this + Instructions for adding Python to your path can be found here, but what you need to add to the path may differ for different distributions. For instance, my Anaconda distribution has several other folders listed in its path entry that are not listed for the more basic Python distribution used at the link above. For completeness if you want to try on your own, my Path has these elements:

        ```
        C:\Users\MyUserName\anaconda3
        C:\Users\MyUserName\anaconda3\Library\mingw-w64\bin
        C:\Users\MyUserName\anaconda3\Library\usr\bin
        C:\Users\MyUserName\anaconda3\Library\bin
        C:\Users\MyUserName\anaconda3\Scripts
        ```

2) Git

If you don't have Git already installed, it can be found here. The default installation options are recommended for most things, but be sure to configure the following options when installing: - Enable Use a TrueType font in all console windows - Select Git from the command line and also from 3rd-party software - Enable file system caching - Enable symbolic links

If you already have Git installed you should check your configuration settings. You can do so by opening the command prompt and typing:

> git config --list

Somewhere in the list of variables that pops up you should see:

core.fscache = true
core.symlinks = true

If not, run the following commands from command prompt to change those settings:

> git config --global core.symlinks true
> git config --global core.fscache true

The Git from the command line and also from 3rd-party software option is the recommended setting during installation. To check, you can do one of two things: 1. Navigate to C:\Program Files\Git\etc\install-options, and check for the line "Path Option: Cmd" within that file, OR 2. You can check your Windows path to see if Git is on the path (follow the steps described above for checking if Python is on your Windows path). Git will appear under the "System variables" pane under "Path" instead of under the "User variables" pane. + If it isn't there, instructions for adding Git to the path can be found here but this is untested as to whether it will work correctly, especially if you've already installed git-annex and DataLad.

Unfortunately, you cannot check whether the Use a TrueType font in all console windows was selected as far as I'm aware, but it is unclear what the implications of not doing that are and whether it would cause DataLad to not work. If DataLad doesn't work for you once you get there, it is possible that you will need to reinstall git.

3) Git Annex

DO NOT INSTALL GIT ANNEX directly from their website, because this does not seem to seem to currently work. The Windows installer is still in beta and that there are some known issues. Luckily you can use the git-annex installer provided by DataLad, which does work.

Run these three commands from the command line to install git-annex:

> pip install datalad-installer
> datalad-installer git-annex -m datalad/packages
> git config --global filter.annex.process "git-annex filter-process"

4) Datalad

Installing datalad itself is easy too. Run the following in the command line:

> pip install datalad

You are now ready to get started with DataLad! (after you read this general Warning for windows users from DataLad) And as a final tip, DataLad seems to work best on Windows when used via its Python API which can be easily accessed in Python as follows:

import datalad.api as dl

Windows Path Separators

When using the DataLad via the command line you will need to first navigate to the folder where the data was installed before you can download the data (this doesn't matter when using the Python API).

The cd command is used in the command prompt to change directory. You might notice how the path separators (/ and \\ ) are different in the first and second commands. This is a potential issue you might run into when using Windows vs Mac/Unix with Python. The backslash Windows file separator \ is different from the forward slash / used in other operating systems, or on URLs. If you're ever using Python to run DataLad commands, or to load and save any kind of data in Python more generally, you may run into problems if you copy folder paths from Windows File Explorer into Python because they'll have the wrong separator. You can fix this in your Python scripts by switching all the / to \, or you can use a double \\ which also works.

  • When you open the command prompt, you are in a default directory, which is displayed on the command line. Likely this is C:\Users\YourUserName\\ and that will show up in the command line as:

    C:\Users\YourUserName> _
    (which is where the ">" before all the code lines comes from)
    
    • If your installed dataset lives at C:\Users\YourUserName\ClassData\Localizer\\ you need to navigate to that directory using the cd command before the DataLad get command will work, which would look like this in the command line:

      C:\Users\YourUserName> cd ClassData\Localizer

  • Once you are in the data directory, you don't have to type the entire filepath, and you can run a command like

    datalad get sub-S01`
    
  • And in the command line that whole thing would be rendered like this:

    C:\Users\YourUserName\ClassData\Localizer> datalad get sub-S01
    

Download Data with DataLad

The Pinel localizer dataset can be accessed at the following location https://gin.g-node.org/ljchang/Localizer/. To download the Localizer dataset run datalad install https://gin.g-node.org/ljchang/Localizer in a terminal in the location where you would like to install the dataset. Don't forget to change the directory to a folder on your local computer. The full dataset is approximately 42gb.

You can run this from the notebook using the ! cell magic.

import os
os.chdir('~/Dropbox/Dartbrains/data')

#! datalad install https://gin.g-node.org/ljchang/Localizer
subprocess.call(['datalad', 'install', 'https://gin.g-node.org/ljchang/Localizer'])
multiple-defs: The variable 'os' was defined by another cell

Datalad Basics

You might be surprised to find that after cloning the dataset that it barely takes up any space du -sh. This is because cloning only downloads the metadata of the dataset to see what files are included.

You can check to see how big the entire dataset would be if you downloaded everything using datalad status.

import os
os.chdir('~/Dropbox/Dartbrains/data/Localizer')

#! datalad status --annex
subprocess.call(['datalad', 'status', '--annex'])
multiple-defs: The variable 'os' was defined by another cell

Getting Data

One of the really nice features of datalad is that you can see all of the data without actually storing it on your computer. When you want a specific file you use datalad get <filename> to download that specific file. Importantly, you do not need to download all of the dat at once, only when you need it.

Now that we have cloned the repository we can grab individual files. For example, suppose we wanted to grab the first subject's confound regressors generated by fmriprep.

#! datalad get participants.tsv
subprocess.call(['datalad', 'get', 'participants.tsv'])
It is highly recommended to configure Git before using DataLad. Set both 'user.name' and 'user.email' configuration variables.
get(impossible): participants.tsv [path does not exist]
1

Now we can check and see how much of the total dataset we have downloaded using datalad status

#! datalad status --annex all
subprocess.call(['datalad', 'status', '--annex', 'all'])
It is highly recommended to configure Git before using DataLad. Set both 'user.name' and 'user.email' configuration variables.
nothing to save, working tree clean
0

If you would like to download all of the files you can use datalad get .. Depending on the size of the dataset and the speed of your internet connection, this might take awhile. One really nice thing about datalad is that if your connection is interrupted you can simply run datalad get . again, and it will resume where it left off.

You can also install the dataset and download all of the files with a single command datalad install -g https://gin.g-node.org/ljchang/Localizer. You may want to do this if you have a lot of storage available and a fast internet connection. For most people, we recommend only downloading the files you need for a specific tutorial.

Dropping Data

Most people do not have unlimited space on their hard drives and are constantly looking for ways to free up space when they are no longer actively working with files. Any file in a dataset can be removed using datalad drop. Importantly, this does not delete the file, but rather removes it from your computer. You will still be able to see file metadata after it has been dropped in case you want to download it again in the future.

As an example, let's drop the Localizer participants .tsv file.

#! datalad drop participants.tsv
subprocess.call(['datalad', 'drop', 'participants.tsv'])
It is highly recommended to configure Git before using DataLad. Set both 'user.name' and 'user.email' configuration variables.
0

Datalad has a Python API!

One particularly nice aspect of datalad is that it has a Python API, which means that anything you would like to do with datalad in the commandline, can also be run in Python. See the details of the datalad Python API.

For example, suppose you would like to clone a data repository, such as the Localizer dataset. You can run dl.clone(source=url, path=location). Make sure you set localizer_path to the location where you would like the Localizer repository installed.

import os
import glob
import datalad.api as dl
import pandas as pd

localizer_path = '/Users/lukechang/Dropbox/Dartbrains/data/Localizer'

dl.clone(source='https://gin.g-node.org/ljchang/Localizer', path=localizer_path)
multiple-defs: The variable 'os' was defined by another cell

We can now create a dataset instance using dl.Dataset(path_to_data).

ds = dl.Dataset(localizer_path)
multiple-defs: The variable 'ds' was defined by another cell

How much of the dataset have we downloaded? We can check the status of the annex using ds.status(annex='all').

results = ds.status(annex='all')
NameError: Name `ds` is not defined. It was expected to be defined in

Traceback (most recent call last):
  File "", line 1, in <module>
    results = ds.status(annex='all')
              ^^
NameError: name 'ds' is not defined

Looks like it's empty, which makes sense since we only cloned the dataset.

Now we need to get some data. Let's start with something small to play with first.

Let's use glob to find all of the tab-delimited confound data generated by fmriprep.

file_list = glob.glob(os.path.join(localizer_path, '*', 'fmriprep', '*', 'func', '*tsv'))
file_list.sort()
file_list[:10]
NameError: Name `glob` is not defined. It was expected to be defined in

Traceback (most recent call last):
  File "", line 1, in <module>
    file_list = glob.glob(os.path.join(localizer_path, '*', 'fmriprep', '*', 'func', '*tsv'))
                ^^^^
NameError: name 'glob' is not defined. Did you forget to import 'glob'?

glob can search the filetree and see all of the relevant data even though none of it has been downloaded yet.

Let's now download the first subjects confound regressor file and load it using pandas.

result = ds.get(file_list[0])

confounds = pd.read_csv(file_list[0], sep='\t')
confounds.head()
Ancestor raised: An ancestor raised an exception (MarimoExceptionRaisedError)

What if we wanted to drop that file? Just like the CLI, we can use ds.drop(file_name).

result_1 = ds.drop(file_list[0])
Ancestor raised: An ancestor raised an exception (MarimoExceptionRaisedError)

To confirm that it is actually removed, let's try to load it again with pandas.

confounds_1 = pd.read_csv(file_list[0], sep='\t')
Ancestor raised: An ancestor raised an exception (MarimoExceptionRaisedError)

Looks like it was successfully removed.

We can also load the entire dataset in one command if want using ds.get(dataset='.', recursive=True). We are not going to do it right now as this will take awhile and require lots of free hard disk space.

Let's actually download one of the files we will be using in the tutorial. First, let's use glob to get a list of all of the functional data that has been preprocessed by fmriprep, denoised, and smoothed.

file_list_1 = glob.glob(os.path.join(localizer_path, 'derivatives', 'fmriprep', '*', 'func', '*task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz'))
file_list_1.sort()
file_list_1
NameError: Name `glob` is not defined. It was expected to be defined in

Traceback (most recent call last):
  File "", line 1, in <module>
    file_list_1 = glob.glob(os.path.join(localizer_path, 'derivatives', 'fmriprep', '*', 'func', '*task-localizer_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz'))
                  ^^^^
NameError: name 'glob' is not defined. Did you forget to import 'glob'?

Now let's download the first subject's file using ds.get(). This file is 825mb, so this might take a few minutes depending on your internet speed.

result_2 = ds.get(file_list_1[0])
Ancestor raised: An ancestor raised an exception (MarimoExceptionRaisedError)

How much of the dataset have we downloaded? We can check the status of the annex using ds.status(annex='all').

result_3 = ds.status(annex='all')
NameError: Name `ds` is not defined. It was expected to be defined in

Traceback (most recent call last):
  File "", line 1, in <module>
    result_3 = ds.status(annex='all')
               ^^
NameError: name 'ds' is not defined

Download Data for Course

Now let's download the data we will use for the course. We will download: - sub-S01's raw data - experimental metadata - preprocessed data for the first 20 subjects including the fmriprep QC reports.

result_4 = ds.get(os.path.join(localizer_path, 'sub-S01'))
result_4 = ds.get(glob.glob(os.path.join(localizer_path, '*.json')))
result_4 = ds.get(glob.glob(os.path.join(localizer_path, '*.tsv')))
result_4 = ds.get(glob.glob(os.path.join(localizer_path, 'phenotype')))
NameError: Name `ds` is not defined. It was expected to be defined in

Traceback (most recent call last):
  File "", line 1, in <module>
    result_4 = ds.get(os.path.join(localizer_path, 'sub-S01'))
               ^^
NameError: name 'ds' is not defined
file_list_2 = glob.glob(os.path.join(localizer_path, '*', 'fmriprep', 'sub*'))
file_list_2.sort()
for f in file_list_2[:20]:
    result_5 = ds.get(f)
NameError: Name `glob` is not defined. It was expected to be defined in

Traceback (most recent call last):
  File "", line 1, in <module>
    file_list_2 = glob.glob(os.path.join(localizer_path, '*', 'fmriprep', 'sub*'))
                  ^^^^
NameError: name 'glob' is not defined. Did you forget to import 'glob'?

To get the python packages for the course, install the dependencies listed in the pyproject.toml using uv sync.

(run-preprocessing)=

Preprocessing

The data has already been preprocessed using fmriprep, which is a robust, but opinionated automated preprocessing pipeline developed by Russ Poldrack's group at Stanford University. The developer's have made a number of choices about how to preprocess your fMRI data using best practices and have created an automated pipeline using multiple software packages that are all distributed via a docker container.

Though, you are welcome to just start working right away with the preprocessed data, here are the steps to run it yourself:

    1. Install Docker and download image

    docker pull poldracklab/fmriprep:<latest-version>

    1. Run a single command in the terminal specifying the location of the data, the location of the output, the participant id, and a few specific flags depending on specific details of how you want to run the preprocessing.

    fmriprep-docker /Users/lukechang/Dropbox/Dartbrains/Data/localizer /Users/lukechang/Dropbox/Dartbrains/Data/preproc participant --participant_label sub-S01 --write-graph --fs-no-reconall --notrack --fs-license-file ~/Dropbox/Dartbrains/License/license.txt --work-dir /Users/lukechang/Dropbox/Dartbrains/Data/work

In practice, it's alway a little bit finicky to get everything set up on a particular system. Sometimes you might run into issues with a specific missing file like the freesurfer license even if you're not using it. You might also run into issues with the format of the data that might have some conflicts with the bids-validator. In our experience, there is always some frustrations getting this to work, but it's very nice once it's done.