Skip to content

A full run on the example dataset

This walkthrough takes a real example dataset — stimulus audio and fMRIPrep outputs — to the two products an encoding model needs: phonemic features and denoised BOLD, using one command per step. By the end you will have run the whole hypline pipeline and seen exactly what each step reads and writes.

It assumes you have hypline installed (see Installation, including FFmpeg for transcription). No prior hypline experience is needed, but skim The hypline dataset layout first if a path or filename below is ever unclear — this tutorial shows the layout in action rather than re-explaining it.

What to expect: about 15 minutes start to finish — most of it the one-time ~2.8 GB dataset download. Of the compute, transcription is the slow step — roughly a minute on a laptop CPU; the rest run in seconds.

1. Get the example dataset

Download the example dataset and unpack it. Throughout this tutorial we call the unpacked dataset root data/.

# TODO: replace with the published Zenodo record on release
curl -L -o hypline-tutorial-data.zip "<ZENODO-DOI>"
unzip hypline-tutorial-data.zip -d data/

The dataset is a BIDS-style tree for one dyad — two partners (sub-031 and sub-032) who held a conversation while both were scanned. It already contains the inputs hypline needs: stimulus audio under stimuli/, raw events and BOLD under each sub-*/, and fMRIPrep outputs under derivatives/fmriprep/.

What this example dataset is — and isn't

It is a faithful subset of a real hyperscanning study (about 2.8 GB), trimmed so it is small enough to download and run quickly:

  • Two of the study's five runs are included (run-1, run-2). The per-run file set and the dyad structure are otherwise complete.
  • Audio is released for the reading-condition (R) trials only, for privacy. The events.tsv files still list every trial, so you will see audio for a subset of the events in each run — this is expected, not a packaging error (see step 2).

Otherwise the dataset mirrors the structure and design of the original study.

Every command below takes data/ as its only positional argument and discovers its inputs from the directory layout — you never pass individual file paths.

2. Transcribe the audio

transcribe turns each stimulus .wav into a word-level transcript using a Whisper speech-recognition model.

hypline transcribe data/ --audio-ext .wav --model tiny
Transcribing dyad-030_ses-1_task-conv_run-1_trial-1_audio.wav
Transcribing dyad-030_ses-1_task-conv_run-1_trial-3_audio.wav
Transcribing dyad-030_ses-1_task-conv_run-2_trial-1_audio.wav
Transcribing dyad-030_ses-1_task-conv_run-2_trial-3_audio.wav

(Log lines are abridged here; a first run also prints a one-time model download and voice-activity-detection messages.)

Why --model tiny

tiny keeps this tutorial fast — about a minute on a laptop CPU, with a small one-time model download. It mis-hears some words, which is fine here: you are learning the workflow, not analyzing the transcripts. For a real analysis, omit --model to use the default large-v2 — far more accurate, but a multi-GB download and much slower on CPU (pass --device cuda if you have a GPU).

Notice only four files are transcribed, not one per run. That is the reading-condition subset from step 1: each run's audio covers only its R trials (trial-1, trial-3), so transcripts exist for those trials and not the others. The run's events.tsv still describes every trial — hypline simply transcribes the audio that is present.

The transcripts land beside the audio, under a new transcript/ subdirectory:

data/stimuli/dyad-030/ses-1/transcript/
├── dyad-030_ses-1_task-conv_run-1_trial-1_transcript.csv
├── dyad-030_ses-1_task-conv_run-1_trial-3_transcript.csv
├── dyad-030_ses-1_task-conv_run-2_trial-1_transcript.csv
└── dyad-030_ses-1_task-conv_run-2_trial-3_transcript.csv

Each CSV is one row per word, with its timing and the partner who spoke it:

word,start_time,end_time,confidence_score,turn_sub
Thank,5.714,6.095,0.368,031
you.,6.195,6.416,0.326,031

These transcripts are dyad-keyed (dyad-030), because the conversation belongs to the pair, not to either partner. See Subject vs. dyad for why.

Check

ls data/stimuli/dyad-030/ses-1/transcript/ lists four _transcript.csv files, and each opens with the word,start_time,end_time,… header above. No transcripts means the audio was not found — confirm you passed --audio-ext .wav and that data/ is the unpacked dataset root.

3. Generate phonemic features

featuregen phonemic reads those transcripts and computes a phonemic feature for each — a per-word representation that becomes a predictor in the encoding model.

hypline featuregen phonemic data/
Generating phonemic features for dyad-030_ses-1_task-conv_run-1_trial-1_transcript.csv
...
Generating phonemic confounds for dyad-030_ses-1_task-conv_run-1_trial-1_feat-phonemic.parquet
...

By default this step also generates the matching phonemic confounds — speech-onset and speech-rate regressors derived from the same features — so you get both in one command. (Pass --skip-confoundgen to suppress that, or run confoundgen phonemic on its own later.)

Two new areas appear, both dyad-keyed:

data/
├── features/dyad-030/ses-1/phonemic/
│   └── dyad-030_ses-1_task-conv_run-1_trial-1_feat-phonemic.parquet   # … one per transcript (4)
└── confounds/dyad-030/ses-1/
    ├── phonemic-onset/
    │   └── dyad-030_ses-1_task-conv_run-1_trial-1_conf-phonemic_desc-onset.parquet   # … (4)
    └── phonemic-rate/
        └── dyad-030_ses-1_task-conv_run-1_trial-1_conf-phonemic_desc-rate.parquet    # … (4)

The two confound flavors live in their own subdirectories because they are desc variants of the same conf-phonemic kind — see Variants with desc.

That completes the stimulus branch: from audio to the features (and confounds) the encoding model uses as predictors.

Check

You should have four feat-phonemic.parquet files under features/, plus four files in each of the phonemic-onset/ and phonemic-rate/ confound subdirectories — one per transcript from step 2.

4. Denoise the BOLD

The other branch cleans the BOLD signal — the encoding model's target. denoise reads fMRIPrep's preprocessed BOLD and regresses out nuisance signals you select from fMRIPrep's own confounds table.

hypline denoise data/ \
  --columns trans_x,trans_y,trans_z,rot_x,rot_y,rot_z,cosine
Denoising starting: sub-031_ses-1_task-conv_run-1_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz
Denoising complete: sub-031_ses-1_task-conv_run-1_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz
...

Here --columns names confound columns from fMRIPrep's table: the six head-motion parameters (trans_*, rot_*) plus cosine, a prefix that expands to every cosine-drift regressor. We did not pass --space, so denoise cleaned the default volumetric space (MNI152NLin2009cAsym) — the main target for most analyses.

This step is sub-keyed: it processes each partner's brain (sub-031, sub-032) independently, so all four run × subject combinations are denoised. The output goes to hypline's own derivatives tree, leaving fMRIPrep's untouched:

data/derivatives/hypline/sub-031/ses-1/func/
├── sub-031_ses-1_task-conv_run-1_space-MNI152NLin2009cAsym_desc-denoised_bold.nii.gz
└── sub-031_ses-1_task-conv_run-1_space-MNI152NLin2009cAsym_desc-denoised_bold.json

The same pair is written for each run and subject — sub-031 and sub-032, run-1 and run-2.

Each denoised BOLD carries a .json sidecar recording exactly how it was made — the desc-preproc source it came from, the resolved regressor columns, and the hypline version — so the result is reproducible. See the denoise reference for CompCor selectors, custom nuisance/ regressors, and surface spaces.

Check

derivatives/hypline/ now holds a desc-denoised .nii.gz + .json pair for each subject and run — eight files total (2 subjects × 2 runs × 2). If the command logged No subjects found, check that derivatives/fmriprep/ unpacked correctly under data/.

5. (Optional) Add a custom nuisance regressor

So far denoise pulled every regressor from fMRIPrep's confounds table via --columns. The other channel is custom nuisance files under nuisance/ — run-level regressors you supply yourself that fMRIPrep never produced (e.g. physiological recordings). The example dataset ships a small set so you can try this path:

data/nuisance/sub-031/ses-1/demo/
└── sub-031_ses-1_task-conv_run-1_nuis-demo_timeseries.tsv   # … one per subject × run (4)

These are synthetic

The shipped nuis-demo files hold synthetic placeholder regressors (demo_regressor1, demo_regressor2) — not real signals — so the tutorial can exercise --custom-sources without needing physiological data. In a real analysis you author these yourself; the denoise reference documents the nuisance/ file format under --custom-sources.

Re-run denoise adding the custom source. --custom-sources names the nuisance/<kind>/ directory and --custom-columns selects columns from it; the two go together:

hypline denoise data/ \
  --columns trans_x,trans_y,trans_z,rot_x,rot_y,rot_z,cosine \
  --custom-sources demo --custom-columns demo_regressor1,demo_regressor2 \
  --force

The synthetic regressors are now stacked with the fMRIPrep columns into one regressor matrix. We pass --force because step 4 already wrote desc-denoised outputs — without it, denoise skips files that exist. The .json sidecar now also records custom_sources and custom_columns, so the result stays reproducible.

Check

The same eight desc-denoised files are rewritten, and each sidecar's custom_columns lists demo_regressor1, demo_regressor2. A Nuisance column name collision across channels error means a custom column name collides with a selected fMRIPrep column — rename or drop one.

What you have now

data/ now holds both sides an encoding model joins:

Side Where From
Predictors features/dyad-030/…/phonemic/ steps 2–3
Target derivatives/hypline/sub-*/…/func/ step 4

Each command read only what the previous steps wrote — no file paths, just the dataset root. To regenerate a step after changing an option, re-run it with --force; without it, hypline skips outputs that already exist.

Where to go next