Meta Open Sources Foundation Model That Predicts Brain Responses to Speech, Video, and Text

Date:

Node: 4928120

  News

Meta has released TRIBE v2, an open-source foundation model designed to predict human brain responses to visual and auditory stimuli and construct a digital twin of neural activity. The model is trained on more than 500 hours of fMRI recordings from over 700 individuals and builds on an architecture previously benchmarked in the Algonauts 2025 challenge. It supports zero-shot predictions across new subjects, languages, and task settings. 

The release includes the model, codebase, paper, and an interactive demo, encouraging applications in computational neuroscience, AI system design, and simulation-based research in neurological disease, as well as broader benchmarking across neuroscience and AI research communities.

Scaling brain modeling beyond single-subject systems

A central constraint in neuroscience has been the need to collect new brain recordings for each experiment, limiting reproducibility and scale. TRIBE v2 approaches this by learning shared representations of brain activity across individuals, allowing it to generate predictions for unseen subjects and tasks.

The model is trained on a dataset that combines long-duration recordings with a relatively large cohort, compared to earlier approaches that typically rely on small numbers of participants. This shift enables what is described as zero-shot generalization, where the system can infer brain responses without additional calibration.

TRIBE v2 can simulate established neuroscience experiments in silico by generating predicted brain activity for specific stimuli. In one example, when given a sentence, TRIBE v2 predicts fMRI activity across established language-processing regions, reproducing patterns typically observed in human studies. 

The system can map neural responses linked to specific categories such as places, bodies, and faces, as well as higher-level functions including speech processing, semantic understanding, and emotional signals, suggesting it can approximate how different cognitive domains are distributed across the brain.

Model architecture

TRIBE v2 uses a three-stage pipeline:

  • Tri-modal encoding: pretrained embeddings from audio, video, and text models capture stimulus features aligned with both AI representations and human perception
  • Universal integration: a transformer layer integrates these embeddings into shared representations across tasks and individuals
  • Brain mapping: a subject-specific layer maps these representations onto fMRI voxels, which reflect neural activity via blood oxygenation changes

This design separates generalizable representations from individual variability, allowing the model to scale across datasets and subjects.

The model expands prediction coverage from approximately 1,000 cortical regions in the earlier version to around 70,000 voxels across the whole brain. This increase in spatial resolution allows for finer-grained mapping of neural activity patterns.

Noise reduction and “canonical” brain responses

fMRI data is inherently noisy due to physiological and technical artifacts. TRIBE v2 addresses this by learning shared signal structure across individuals, producing what can be described as a canonical response pattern.

In some evaluations, these predicted responses align more closely with group-averaged neural activity than individual fMRI scans, suggesting the model functions as a denoising layer over raw measurements.

Image credit: Meta

Model performance reportedly increases log-linearly with additional training data, indicating that further improvements may depend on access to larger and more diverse fMRI datasets. 

Noninvasive BCI efforts have already shown how aggregated EEG, MEG, and fNIRS data can support language-level prediction with measurable zero-shot performance, while open benchmark efforts for speech and image reconstruction are starting to standardize evaluation across the field.

Cover image: TRIBE v2 interactive demo, Meta

Topic:
AI in Bio