LAION-fMRI

A densely-sampled 7T fMRI dataset spanning four image distributions (LAION-natural, MSCOCO, THINGS, and out-of-distribution images), designed to broadly cover the image space and enable robust replication and generalization of visual neuroscience findings.

What's included

Dataset Contents

Data modalities

All data follows BIDS format, with raw volumes in subject directories and processed outputs in derivatives/. The primary starting point for most analyses is the GLMsingle beta estimates.

fMRI beta estimates

GLMsingle single-trial BOLD responses: (n_trials × n_voxels) per session

Noise ceiling maps

Per-session and cross-session estimates of maximum explainable variance

Train/test splits

Bundled random, tau, cluster_k5, and OOD splits for re:vision analyses

Stimulus derivatives

Public metadata, CLIP/DINOv2/PEcore/SigLIP2 embeddings, captions, and segmentations

Anatomical (T1w)

High-resolution structural MRI for cortical surface reconstruction via FreeSurfer

Diffusion MRI

DTI data in sub-XX/dwi/ for white matter characterization

Retinotopic mapping

Phase-encoded retinotopy for delineating early visual areas

Functional localizers

Category-selective region localizers (faces, scenes, bodies, objects)

ROI masks

Volumetric, surface, and FreeSurfer-label region-of-interest masks ready for analysis

Stimulus images

25,052 1000 × 1000 px JPEG stimuli packed in a dataset-wide HDF5 file

Python package

How to Access

Installation

bash
python -m pip install "git+https://github.com/ViCCo-Group/LAION-fMRI.git@main"

Installs the current laion_fmri Python package directly from GitHub. The S3 bucket is publicly accessible, no AWS credentials required.

Discover

python
from laion_fmri.config import dataset_initialize
from laion_fmri.discovery import describe, get_rois, get_subjects

dataset_initialize("/path/to/data")      # one-time setup
print(get_subjects())                     # ['sub-01', 'sub-03', ...]
print(get_rois("sub-03", category="face"))
describe()                                # human-readable bucket summary

Call dataset_initialize once to register your local data directory. Use get_subjects() to list participants and describe() for a summary of bucket contents.

Download

python
from laion_fmri.download import download

# Download one session (BIDS-aware, idempotent)
download(subject="sub-03", ses="ses-01", n_jobs=4)

# Download only subject-level aggregate maps
download(subject="sub-03", ses="averages")

# Download everything for all subjects
download(subject="all", n_jobs=4)

# Include stimulus images
download(subject="sub-03", ses="ses-01", include_stimuli=True, n_jobs=4)

Fetches data from S3 to your local directory. Downloads are idempotent, so re-running skips files already present. Use ses="averages" to retrieve only cross-session aggregate maps.

Stimulus files & metadata

python
from laion_fmri.download import (
    download_stimuli,
    download_embeddings,
    download_segmentations,
    download_captions,
)

download_stimuli()              # raw images + metadata
download_embeddings()           # CLIP, DINOv2, PEcore, SigLIP2
download_segmentations()        # shared-image object masks
download_captions()             # human + AI captions

Raw stimulus images need the Data Use Agreement, which the data loader requests during stimulus download.

Load betas & trial info

python
from laion_fmri.subject import load_subject

sub   = load_subject("sub-03")
betas = sub.get_betas(session="ses-01")           # (n_trials, n_voxels), float32

# Filter to a region of interest
betas_ffa = sub.get_betas(session="ses-01", roi="FFA1")

# Only keep well-driven voxels
betas_nc  = sub.get_betas(session="ses-01", nc_threshold=0.2)

# Get trial metadata as a DataFrame
trials = sub.get_trial_info(session="ses-01")     # image IDs, conditions, ...

Returns single-trial beta estimates as a float32 array of shape (n_trials x n_voxels). Restrict to a brain region with roi=, filter for reliably driven voxels with nc_threshold=, or retrieve trial metadata as a DataFrame with get_trial_info().

Load stimulus-aligned data

python
import laion_fmri
from laion_fmri.subject import load_subject

stim = laion_fmri.load_stimuli()
stim.metadata.head()                               # 25,052 image rows
stim.embeddings.get("CLIP", "shared_12rep_LAION_cluster_1003_i0.jpg")
stim.captions.human("shared_12rep_LAION_cluster_1003_i0.jpg")
stim.segmentations.nouns("shared_12rep_LAION_cluster_1003_i0.jpg")

sub = load_subject("sub-03")
trials = sub.metadata                              # trial table across sessions
X = sub.embeddings.all("CLIP", session="ses-01")   # rows align to betas
img = sub.images.get(42)                           # PIL image for trial 42

The stimulus hub exposes images, metadata, embeddings, captions, and object masks by image name. Subject-level namespaces expose the same modalities by global trial index, so rows line up with the beta and trial tables.

Train / test splits

python
from laion_fmri.splits import get_split_masks
from laion_fmri.subject import load_subject
import numpy as np
import pandas as pd

sub = load_subject("sub-03")
sessions = sub.get_sessions()
betas_per_session = sub.get_betas(session=sessions, roi="face")
trials_per_session = sub.get_trial_info(session=sessions)

betas = np.concatenate(list(betas_per_session.values()), axis=0)
trials = pd.concat(list(trials_per_session.values()), ignore_index=True)

# Random split — one of five seeded 80/20 baselines (random_0 … random_4)
train_mask, test_mask = get_split_masks(trials, "random_0", pool="shared")

# Within-distribution split (tau) — balanced by image-space coverage
train_mask, test_mask = get_split_masks(trials, "tau", pool="shared")

# OOD cluster split — 5-fold cross-validation across semantic clusters
for k in range(5):
    train_mask, test_mask = get_split_masks(trials, f"cluster_k5_{k}", pool="shared")

# OOD images — train on shared pool, test on held-out OOD images
train_mask, test_mask = get_split_masks(
    trials, "ood", pool="shared", ood_types=["shape", "unusual", "cropped"]
)

# Apply any mask to betas
X_train, X_test = betas[train_mask], betas[test_mask]

Each split corresponds to a generalization method: tau for Method 1 (within-distribution), cluster_k5_{k} for Method 2 (OOD clusters), and ood for Method 3 (OOD images). Pools are available for shared and each subject; ood_types= can restrict Method 3 to selected OOD categories. The random_* splits are simple baselines for replication analyses.

Noise ceiling & inspection

python
# Noise ceiling: max explainable variance per voxel (0-100)
nc = sub.get_noise_ceiling(session="ses-01")      # (n_voxels,)
nc_12rep = sub.get_noise_ceiling(desc="Noiseceiling12rep")

# Inspect available sessions and ROIs
print(sub.get_sessions())
print(sub.get_available_rois())
print(sub.get_available_categories())
print(f"Brain-mask voxels: {sub.get_n_voxels()}")

Noise ceiling scores express the theoretical maximum variance explainable by any model, per voxel. Also exposes methods to list sessions, ROI categories, ROI masks, and brain-mask voxel counts.

CLI alternative

bash
laion-fmri config --data-dir ./laion_fmri_data
laion-fmri info
laion-fmri download --subject sub-03 --ses ses-01 --n-jobs 4
laion-fmri download-embeddings
laion-fmri download-segmentations
laion-fmri download-captions
laion-fmri download-stimuli

License

fMRI + public derivativesCC0 1.0
Raw stimulus imagesDUA required

Documentation

Full technical docs

Complete API reference, data format specifications, preprocessing pipeline details, example notebooks, and the interactive brain viewer are available at the official documentation site.

Open documentation