
A densely-sampled 7T fMRI dataset spanning four image distributions (LAION-natural, MSCOCO, THINGS, and out-of-distribution images), designed to broadly cover the image space and enable robust replication and generalization of visual neuroscience findings.
Dataset Contents
Data modalities
All data follows BIDS format, with raw volumes in subject directories and processed outputs in derivatives/. The primary starting point for most analyses is the GLMsingle beta estimates.
GLMsingle single-trial BOLD responses: (n_trials × n_voxels) per session
Per-session and cross-session estimates of maximum explainable variance
Bundled random, tau, cluster_k5, and OOD splits for re:vision analyses
Public metadata, CLIP/DINOv2/PEcore/SigLIP2 embeddings, captions, and segmentations
High-resolution structural MRI for cortical surface reconstruction via FreeSurfer
DTI data in sub-XX/dwi/ for white matter characterization
Phase-encoded retinotopy for delineating early visual areas
Category-selective region localizers (faces, scenes, bodies, objects)
Volumetric, surface, and FreeSurfer-label region-of-interest masks ready for analysis
25,052 1000 × 1000 px JPEG stimuli packed in a dataset-wide HDF5 file
How to Access
Installation
python -m pip install "git+https://github.com/ViCCo-Group/LAION-fMRI.git@main"Installs the current laion_fmri Python package directly from GitHub. The S3 bucket is publicly accessible, no AWS credentials required.
Discover
from laion_fmri.config import dataset_initialize
from laion_fmri.discovery import describe, get_rois, get_subjects
dataset_initialize("/path/to/data") # one-time setup
print(get_subjects()) # ['sub-01', 'sub-03', ...]
print(get_rois("sub-03", category="face"))
describe() # human-readable bucket summaryCall dataset_initialize once to register your local data directory. Use get_subjects() to list participants and describe() for a summary of bucket contents.
Download
from laion_fmri.download import download
# Download one session (BIDS-aware, idempotent)
download(subject="sub-03", ses="ses-01", n_jobs=4)
# Download only subject-level aggregate maps
download(subject="sub-03", ses="averages")
# Download everything for all subjects
download(subject="all", n_jobs=4)
# Include stimulus images
download(subject="sub-03", ses="ses-01", include_stimuli=True, n_jobs=4)Fetches data from S3 to your local directory. Downloads are idempotent, so re-running skips files already present. Use ses="averages" to retrieve only cross-session aggregate maps.
Stimulus files & metadata
from laion_fmri.download import (
download_stimuli,
download_embeddings,
download_segmentations,
download_captions,
)
download_stimuli() # raw images + metadata
download_embeddings() # CLIP, DINOv2, PEcore, SigLIP2
download_segmentations() # shared-image object masks
download_captions() # human + AI captionsRaw stimulus images need the Data Use Agreement, which the data loader requests during stimulus download.
Load betas & trial info
from laion_fmri.subject import load_subject
sub = load_subject("sub-03")
betas = sub.get_betas(session="ses-01") # (n_trials, n_voxels), float32
# Filter to a region of interest
betas_ffa = sub.get_betas(session="ses-01", roi="FFA1")
# Only keep well-driven voxels
betas_nc = sub.get_betas(session="ses-01", nc_threshold=0.2)
# Get trial metadata as a DataFrame
trials = sub.get_trial_info(session="ses-01") # image IDs, conditions, ...Returns single-trial beta estimates as a float32 array of shape (n_trials x n_voxels). Restrict to a brain region with roi=, filter for reliably driven voxels with nc_threshold=, or retrieve trial metadata as a DataFrame with get_trial_info().
Load stimulus-aligned data
import laion_fmri
from laion_fmri.subject import load_subject
stim = laion_fmri.load_stimuli()
stim.metadata.head() # 25,052 image rows
stim.embeddings.get("CLIP", "shared_12rep_LAION_cluster_1003_i0.jpg")
stim.captions.human("shared_12rep_LAION_cluster_1003_i0.jpg")
stim.segmentations.nouns("shared_12rep_LAION_cluster_1003_i0.jpg")
sub = load_subject("sub-03")
trials = sub.metadata # trial table across sessions
X = sub.embeddings.all("CLIP", session="ses-01") # rows align to betas
img = sub.images.get(42) # PIL image for trial 42The stimulus hub exposes images, metadata, embeddings, captions, and object masks by image name. Subject-level namespaces expose the same modalities by global trial index, so rows line up with the beta and trial tables.
Train / test splits
from laion_fmri.splits import get_split_masks
from laion_fmri.subject import load_subject
import numpy as np
import pandas as pd
sub = load_subject("sub-03")
sessions = sub.get_sessions()
betas_per_session = sub.get_betas(session=sessions, roi="face")
trials_per_session = sub.get_trial_info(session=sessions)
betas = np.concatenate(list(betas_per_session.values()), axis=0)
trials = pd.concat(list(trials_per_session.values()), ignore_index=True)
# Random split — one of five seeded 80/20 baselines (random_0 … random_4)
train_mask, test_mask = get_split_masks(trials, "random_0", pool="shared")
# Within-distribution split (tau) — balanced by image-space coverage
train_mask, test_mask = get_split_masks(trials, "tau", pool="shared")
# OOD cluster split — 5-fold cross-validation across semantic clusters
for k in range(5):
train_mask, test_mask = get_split_masks(trials, f"cluster_k5_{k}", pool="shared")
# OOD images — train on shared pool, test on held-out OOD images
train_mask, test_mask = get_split_masks(
trials, "ood", pool="shared", ood_types=["shape", "unusual", "cropped"]
)
# Apply any mask to betas
X_train, X_test = betas[train_mask], betas[test_mask]Each split corresponds to a generalization method: tau for Method 1 (within-distribution), cluster_k5_{k} for Method 2 (OOD clusters), and ood for Method 3 (OOD images). Pools are available for shared and each subject; ood_types= can restrict Method 3 to selected OOD categories. The random_* splits are simple baselines for replication analyses.
Noise ceiling & inspection
# Noise ceiling: max explainable variance per voxel (0-100)
nc = sub.get_noise_ceiling(session="ses-01") # (n_voxels,)
nc_12rep = sub.get_noise_ceiling(desc="Noiseceiling12rep")
# Inspect available sessions and ROIs
print(sub.get_sessions())
print(sub.get_available_rois())
print(sub.get_available_categories())
print(f"Brain-mask voxels: {sub.get_n_voxels()}")Noise ceiling scores express the theoretical maximum variance explainable by any model, per voxel. Also exposes methods to list sessions, ROI categories, ROI masks, and brain-mask voxel counts.
CLI alternative
laion-fmri config --data-dir ./laion_fmri_data
laion-fmri info
laion-fmri download --subject sub-03 --ses ses-01 --n-jobs 4
laion-fmri download-embeddings
laion-fmri download-segmentations
laion-fmri download-captions
laion-fmri download-stimuliLicense
Documentation
Full technical docs
Complete API reference, data format specifications, preprocessing pipeline details, example notebooks, and the interactive brain viewer are available at the official documentation site.
Open documentation