Skip to content

hypline transcribe

Transcribe stimulus audio into word-level transcripts using a Whisper speech-recognition model. Each transcript records, for every word, the time it was spoken — the timing that later feeds featuregen.

hypline transcribe <dataset-root> --audio-ext <ext> [OPTIONS]

FFmpeg required

Audio is decoded through FFmpeg, which must be on your PATH. Any FFmpeg-supported format works (WAV, MP3, M4A, FLAC, video containers, …) — pass its extension to --audio-ext.

Inputs

Stimulus audio files under the stimuli/ area, with the _audio suffix:

<dataset-root>/stimuli/dyad-030/ses-1/audio/
└── dyad-030_ses-1_task-conv_run-1_audio.wav

See The hypline dataset layout for how files are named and discovered.

Options

Option Description Default
--audio-ext Extension of the audio files, e.g. .wav (required)
--model Whisper model: tiny, base, small, medium, large-v2, large-v3 large-v2
--model-dir Where to find/download model weights ~/.cache/hypline/whisperx
--device Hardware target: cpu or cuda cpu
--dyad-ids Comma-separated dyad IDs to process; omit for all all
--data-filters Narrow to specific runs/conditions — see Segments and metadata none
--force Overwrite existing transcripts (default skips them) off

Model size vs. speed

Larger models are more accurate but slower. On a GPU, pass --device cuda. The first run downloads the chosen model to --model-dir (default ~/.cache/hypline/whisperx); later runs reuse it.

Relocating the cache

The ~/.cache/hypline root is shared by every command that downloads model weights (transcribe, featuregen semantic, featuregen spectral). Set HYPLINE_CACHE to move that root — e.g. HYPLINE_CACHE=/scratch/$USER/hypline on a cluster with a small home quota. An explicit --model-dir still wins over both.

Example

Transcribe every dyad's WAV audio with the default model:

hypline transcribe data/ --audio-ext .wav

Transcribe only dyad 030 on a GPU (pass more as a comma-separated list):

hypline transcribe data/ --audio-ext .wav --dyad-ids 030 --device cuda

Outputs

A word-level transcript per audio file, with the _transcript suffix, written beside the audio under stimuli/:

<dataset-root>/stimuli/dyad-030/ses-1/
├── audio/
│   └── dyad-030_ses-1_task-conv_run-1_audio.wav
└── transcript/
    └── dyad-030_ses-1_task-conv_run-1_transcript.csv

Each transcript row is one word with its onset time. These onsets are what featuregen phonemic reads to place features on the timeline.

Un-timed words

Whisper occasionally emits a token it cannot place in time (some numerals and symbols). Such tokens appear in the transcript with a blank time and are dropped by downstream feature generation, since an event with no time cannot be aligned to the BOLD signal.

Speaker turns

If your events.tsv files annotate speaking turns, each transcript gains a turn_sub column naming which subject held the floor when each word began.

Mark turns in each subject's raw events.tsv with the flat trial_type label turn_speaker — one row per window where that subject is the assigned speaker:

onset   duration   trial_type
0.0     12.5       turn_speaker
20.0    8.0        turn_speaker
  • The label records whose turn it is by study design, not who was observed speaking — a turn window may still contain a word uttered by the other partner.
  • Mark only your own turns (turn_speaker); transcribe reads both partners' events and combines them, so there is no separate "listening" label to keep in sync.
  • Windows are [onset, onset + duration). Gaps (silence) are allowed; windows must not overlap — within a subject or across partners. A cross-partner overlap is treated as cross-talk and raises an error.
  • turn_speaker onsets are run-relative — the whole-run events.tsv clock, the same frame as your segment (e.g. trial-1) rows. Write them that way even when audio is split per trial; transcribe shifts each word by its segment's onset before matching, so you never annotate turns in per-trial time.

Each word's turn_sub is the bare subject label (001, 101) whose window contains the word's run-relative start. Words that are un-timed, or fall in a gap between turns, get a blank turn_sub; gap hits are logged as a possible timing/annotation mismatch. Transcripts whose runs carry no turn_speaker rows still get the column, with every value null, so the schema is uniform.