A full run on the example dataset¶
This walkthrough takes a real example dataset — stimulus audio and fMRIPrep outputs — to the two products an encoding model needs: phonemic features and denoised BOLD, using one command per step. By the end you will have run the whole hypline pipeline and seen exactly what each step reads and writes.
It assumes you have hypline installed (see Installation, including FFmpeg for transcription). No prior hypline experience is needed, but skim The hypline dataset layout first if a path or filename below is ever unclear — this tutorial shows the layout in action rather than re-explaining it.
What to expect: about 15 minutes start to finish — most of it the one-time ~2.8 GB dataset download. Of the compute, transcription is the slow step — roughly a minute on a laptop CPU; the rest run in seconds.
1. Get the example dataset¶
Download the example dataset and unpack it. Throughout this tutorial we call the
unpacked dataset root data/.
# TODO: replace with the published Zenodo record on release
curl -L -o hypline-tutorial-data.zip "<ZENODO-DOI>"
unzip hypline-tutorial-data.zip -d data/
The dataset is a BIDS-style tree for one dyad —
two partners (sub-031 and sub-032) who held a conversation while both were
scanned. It already contains the inputs hypline needs: stimulus audio under
stimuli/, raw events and BOLD under each sub-*/, and fMRIPrep outputs under
derivatives/fmriprep/.
What this example dataset is — and isn't
It is a faithful subset of a real hyperscanning study (about 2.8 GB), trimmed so it is small enough to download and run quickly:
- Two of the study's five runs are included (
run-1,run-2). The per-run file set and the dyad structure are otherwise complete. - Audio is released for the reading-condition (R) trials only, for
privacy. The
events.tsvfiles still list every trial, so you will see audio for a subset of the events in each run — this is expected, not a packaging error (see step 2).
Otherwise the dataset mirrors the structure and design of the original study.
Every command below takes data/ as its only positional argument and discovers
its inputs from the directory layout — you never pass individual file paths.
2. Transcribe the audio¶
transcribe turns each stimulus .wav into a word-level transcript using a
Whisper speech-recognition model.
hypline transcribe data/ --audio-ext .wav --model tiny
Transcribing dyad-030_ses-1_task-conv_run-1_trial-1_audio.wav
Transcribing dyad-030_ses-1_task-conv_run-1_trial-3_audio.wav
Transcribing dyad-030_ses-1_task-conv_run-2_trial-1_audio.wav
Transcribing dyad-030_ses-1_task-conv_run-2_trial-3_audio.wav
(Log lines are abridged here; a first run also prints a one-time model download and voice-activity-detection messages.)
Why --model tiny
tiny keeps this tutorial fast — about a minute on a laptop CPU, with a small
one-time model download. It mis-hears some words, which is fine here: you are
learning the workflow, not analyzing the transcripts. For a real analysis,
omit --model to use the default large-v2 — far more accurate, but a
multi-GB download and much slower on CPU (pass --device cuda if you have a
GPU).
Notice only four files are transcribed, not one per run. That is the
reading-condition subset from step 1: each run's
audio covers only its R trials (trial-1, trial-3), so transcripts exist for
those trials and not the others. The run's events.tsv still describes every
trial — hypline simply transcribes the audio that is present.
The transcripts land beside the audio, under a new transcript/ subdirectory:
data/stimuli/dyad-030/ses-1/transcript/
├── dyad-030_ses-1_task-conv_run-1_trial-1_transcript.csv
├── dyad-030_ses-1_task-conv_run-1_trial-3_transcript.csv
├── dyad-030_ses-1_task-conv_run-2_trial-1_transcript.csv
└── dyad-030_ses-1_task-conv_run-2_trial-3_transcript.csv
Each CSV is one row per word, with its timing and the partner who spoke it:
word,start_time,end_time,confidence_score,turn_sub
Thank,5.714,6.095,0.368,031
you.,6.195,6.416,0.326,031
These transcripts are dyad-keyed (dyad-030), because the conversation
belongs to the pair, not to either partner. See
Subject vs. dyad for why.
Check
ls data/stimuli/dyad-030/ses-1/transcript/ lists four _transcript.csv
files, and each opens with the word,start_time,end_time,… header above. No
transcripts means the audio was not found — confirm you passed --audio-ext
.wav and that data/ is the unpacked dataset root.
3. Generate phonemic features¶
featuregen phonemic reads those transcripts and computes a phonemic feature
for each — a per-word representation that becomes a predictor in the encoding
model.
hypline featuregen phonemic data/
Generating phonemic features for dyad-030_ses-1_task-conv_run-1_trial-1_transcript.csv
...
Generating phonemic confounds for dyad-030_ses-1_task-conv_run-1_trial-1_feat-phonemic.parquet
...
By default this step also generates the matching phonemic confounds —
speech-onset and speech-rate regressors derived from the same features — so you
get both in one command. (Pass --skip-confoundgen to suppress that, or run
confoundgen phonemic on its own later.)
Two new areas appear, both dyad-keyed:
data/
├── features/dyad-030/ses-1/phonemic/
│ └── dyad-030_ses-1_task-conv_run-1_trial-1_feat-phonemic.parquet # … one per transcript (4)
└── confounds/dyad-030/ses-1/
├── phonemic-onset/
│ └── dyad-030_ses-1_task-conv_run-1_trial-1_conf-phonemic_desc-onset.parquet # … (4)
└── phonemic-rate/
└── dyad-030_ses-1_task-conv_run-1_trial-1_conf-phonemic_desc-rate.parquet # … (4)
The two confound flavors live in their own subdirectories because they are
desc variants of the same conf-phonemic kind — see
Variants with desc.
That completes the stimulus branch: from audio to the features (and confounds) the encoding model uses as predictors.
Check
You should have four feat-phonemic.parquet files under features/, plus
four files in each of the phonemic-onset/ and phonemic-rate/ confound
subdirectories — one per transcript from step 2.
4. Denoise the BOLD¶
The other branch cleans the BOLD signal — the encoding model's target.
denoise reads fMRIPrep's preprocessed BOLD and regresses out nuisance signals
you select from fMRIPrep's own confounds table.
hypline denoise data/ \
--columns trans_x,trans_y,trans_z,rot_x,rot_y,rot_z,cosine
Denoising starting: sub-031_ses-1_task-conv_run-1_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz
Denoising complete: sub-031_ses-1_task-conv_run-1_space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz
...
Here --columns names confound columns from fMRIPrep's table: the six head-motion
parameters (trans_*, rot_*) plus cosine, a prefix that expands to every
cosine-drift regressor. We did not pass --space, so denoise cleaned the
default volumetric space (MNI152NLin2009cAsym) — the main target for most
analyses.
This step is sub-keyed: it processes each partner's brain (sub-031,
sub-032) independently, so all four run × subject combinations are denoised.
The output goes to hypline's own derivatives tree, leaving fMRIPrep's untouched:
data/derivatives/hypline/sub-031/ses-1/func/
├── sub-031_ses-1_task-conv_run-1_space-MNI152NLin2009cAsym_desc-denoised_bold.nii.gz
└── sub-031_ses-1_task-conv_run-1_space-MNI152NLin2009cAsym_desc-denoised_bold.json
The same pair is written for each run and subject — sub-031 and sub-032,
run-1 and run-2.
Each denoised BOLD carries a .json sidecar recording exactly how it was made —
the desc-preproc source it came from, the resolved regressor columns, and the
hypline version — so the result is reproducible. See the
denoise reference for CompCor selectors, custom
nuisance/ regressors, and surface spaces.
Check
derivatives/hypline/ now holds a desc-denoised .nii.gz + .json pair
for each subject and run — eight files total (2 subjects × 2 runs × 2). If
the command logged No subjects found, check that derivatives/fmriprep/
unpacked correctly under data/.
5. (Optional) Add a custom nuisance regressor¶
So far denoise pulled every regressor from fMRIPrep's confounds table via
--columns. The other channel is custom nuisance files under nuisance/ —
run-level regressors you supply yourself that fMRIPrep never produced (e.g.
physiological recordings). The example dataset ships a small set so you can try
this path:
data/nuisance/sub-031/ses-1/demo/
└── sub-031_ses-1_task-conv_run-1_nuis-demo_timeseries.tsv # … one per subject × run (4)
These are synthetic
The shipped nuis-demo files hold synthetic placeholder regressors
(demo_regressor1, demo_regressor2) — not real signals — so the tutorial
can exercise --custom-sources without needing physiological data. In a real
analysis you author these yourself; the
denoise reference documents the
nuisance/ file format under --custom-sources.
Re-run denoise adding the custom source. --custom-sources names the
nuisance/<kind>/ directory and --custom-columns selects columns from it; the
two go together:
hypline denoise data/ \
--columns trans_x,trans_y,trans_z,rot_x,rot_y,rot_z,cosine \
--custom-sources demo --custom-columns demo_regressor1,demo_regressor2 \
--force
The synthetic regressors are now stacked with the fMRIPrep columns into one
regressor matrix. We pass --force because step 4 already
wrote desc-denoised outputs — without it, denoise skips files that exist. The
.json sidecar now also records custom_sources and custom_columns, so the
result stays reproducible.
Check
The same eight desc-denoised files are rewritten, and each sidecar's
custom_columns lists demo_regressor1, demo_regressor2. A
Nuisance column name collision across channels error means a custom column
name collides with a selected fMRIPrep column — rename or drop one.
What you have now¶
data/ now holds both sides an encoding model joins:
| Side | Where | From |
|---|---|---|
| Predictors | features/dyad-030/…/phonemic/ |
steps 2–3 |
| Target | derivatives/hypline/sub-*/…/func/ |
step 4 |
Each command read only what the previous steps wrote — no file paths, just the
dataset root. To regenerate a step after changing an option, re-run it with
--force; without it, hypline skips outputs that already exist.
Where to go next¶
- Process only some runs or conditions — Filter to specific runs or conditions.
- Regenerate outputs after a fix — Regenerate outputs.
- Per-command options — the Reference pages.