hypline featuregen¶
Generate stimulus-derived features — the predictors (X) an encoding model
maps onto the BOLD signal. featuregen is a group of subcommands, one per
feature kind.
hypline featuregen <kind> <dataset-root> [OPTIONS]
| Subcommand | Generates |
|---|---|
phonemic |
phoneme-level articulatory features from transcripts |
semantic |
contextual word embeddings from a Hugging Face causal LM |
spectral |
Whisper log-Mel spectrogram from stimulus audio, aligned to the BOLD TR grid |
syntactic |
per-token POS, dependency, and stopword features from transcripts |
featuregen phonemic¶
Derive phoneme-level features from word-level transcripts. Each word is looked up in CMUdict to get its phonemes, and each phoneme is represented by its articulatory features (place, manner, voicing, …).
Inputs¶
Transcripts produced by transcribe, under stimuli/:
<dataset-root>/stimuli/dyad-030/ses-1/transcript/
└── dyad-030_ses-1_task-conv_run-1_transcript.csv
Options¶
| Option | Description | Default |
|---|---|---|
--no-articulatory |
Use a plain phoneme identity vector instead of articulatory features | off |
--desc |
Tag outputs as a named variant (alphanumeric), e.g. --desc v2 → desc-v2 |
none |
--skip-confoundgen |
Write features only; do not also generate phonemic confounds | off |
--dyad-ids |
Comma-separated dyad IDs to process; omit for all | all |
--data-filters |
Narrow to specific runs/conditions — see Segments and metadata | none |
--force |
Overwrite existing outputs (default skips them) | off |
Features and confounds together
By default, featuregen phonemic also runs confoundgen phonemic
on the features it just wrote — generating the timing-based phonemic
confounds in the same pass. This is the common path; pass
--skip-confoundgen if you want features without confounds.
Example¶
Generate phonemic features (and their confounds) for all dyads:
hypline featuregen phonemic data/
Generate a second variant, features only — --no-articulatory swaps the
articulatory vector for a plain phoneme identity vector, and --desc identity
names the variant after it so it sits beside the default features:
hypline featuregen phonemic data/ --desc identity --no-articulatory --skip-confoundgen
Outputs¶
A phonemic feature file per transcript, tagged feat-phonemic, under
features/. With --skip-confoundgen omitted, the matching conf-phonemic
confounds appear too (see confoundgen):
<dataset-root>/
├── features/dyad-030/ses-1/phonemic/
│ └── dyad-030_ses-1_task-conv_run-1_feat-phonemic.parquet
└── confounds/dyad-030/ses-1/ # from the chained confoundgen
├── phonemic-onset/
│ └── dyad-030_ses-1_task-conv_run-1_conf-phonemic_desc-onset.parquet
└── phonemic-rate/
└── dyad-030_ses-1_task-conv_run-1_conf-phonemic_desc-rate.parquet
A --desc label lands as desc-<label> and lives in its own subdirectory
(phonemic-<label>/), keeping variants separate. See
The hypline dataset layout.
Feature file format
Feature files are Parquet tables. Each row is one phoneme with a
start_time (seconds from the start of the stimulus), a turn_sub label
(forward-filled from the transcript; carried through unchanged), and a
feature vector.
Timing is the phoneme's word onset — hypline does not yet have sub-word
audio alignment, so phonemes within a word share that word's onset.
featuregen semantic¶
Derive contextual word embeddings from any Hugging Face causal (decoder) LM.
The transcript is tokenized and run through the model; each sub-word token's
hidden state at a chosen --layer becomes its feature vector. The model id is
passed verbatim to from_pretrained, so the hub is open and unbounded — any
causal LM with a fast (Rust) tokenizer and a BOS token works.
Inputs¶
The same transcripts as featuregen phonemic, under
stimuli/. Untimed words (null start_time) are retained as real LM context
but carry their null timing into the output; null-word rows are skipped.
The whole transcript is encoded in one forward pass, so it must fit the model's
context window (tokens + a BOS prefix ≤ max_position_embeddings). A long
transcript on a short-context model (gpt-2 caps at 1024) raises rather than
truncating — reach for a longer-context LM instead.
Options¶
| Option | Description | Default |
|---|---|---|
--model |
Required. Hugging Face causal-LM id (e.g. gpt2-xl, meta-llama/Llama-3.2-1B) |
— |
--model-dir |
Cache dir for downloaded weights | ~/.cache/hypline/huggingface |
--device |
Hardware target (cpu or cuda) |
cpu |
--layer |
Hidden-layer index in 0..num_hidden_layers; omit for the middle layer |
middle |
--desc |
Tag outputs as a named variant (alphanumeric), e.g. --desc v2 → desc-v2 |
none |
--skip-confoundgen |
Write features only; do not also generate semantic confounds | off |
--dyad-ids |
Comma-separated dyad IDs to process; omit for all | all |
--data-filters |
Narrow to specific runs/conditions — see Segments and metadata | none |
--force |
Overwrite existing outputs (default skips them) | off |
Features and confounds together
As with phonemic, featuregen semantic also runs
confoundgen semantic by default. Pass --skip-confoundgen
for features without confounds.
Example¶
Generate gpt-2 semantic features (and their confounds) for all dyads:
hypline featuregen semantic data/ --model gpt2-xl
Outputs¶
A semantic feature file per transcript, tagged feat-semantic, under
features/. Each row carries start_time, turn_sub (forward-filled from the
transcript; carried through unchanged), word, token, the feature
vector, and — for any non-zero layer — per-token LM metrics (rank,
true_prob, entropy). The Parquet footer records hf_model, hf_tokenizer
(equal to hf_model unless overridden via the Python API), and
layer. With --skip-confoundgen omitted, the matching conf-semantic
confounds appear too (see confoundgen):
<dataset-root>/
├── features/dyad-030/ses-1/semantic/
│ └── dyad-030_ses-1_task-conv_run-1_feat-semantic.parquet
└── confounds/dyad-030/ses-1/ # from the chained confoundgen
├── semantic-onset/
│ └── dyad-030_ses-1_task-conv_run-1_conf-semantic_desc-onset.parquet
└── semantic-rate/
└── dyad-030_ses-1_task-conv_run-1_conf-semantic_desc-rate.parquet
Causal LMs only
The model must be a causal/decoder LM with a fast tokenizer and a BOS token.
Encoder checkpoints (BERT and the like) are rejected up front — from_pretrained
would silently load them and emit garbage.
Gated models
Some models (e.g. Llama) are license-gated. Request access on the model's Hugging Face page, then set an access token so the download can authenticate:
export HF_TOKEN=hf_...
hypline featuregen semantic data/ --model meta-llama/Llama-3.2-1B
transformers reads HF_TOKEN automatically — no extra flag. Access must be
granted to the same account that issued the token, or the download fails.
featuregen spectral¶
Derive a log-Mel spectrogram from the stimulus audio using a
Whisper feature extractor — the same
front-end that turns audio into the input Whisper's encoder sees. Unlike
phonemic and semantic, this reads audio directly (no transcript needed) and
its output is pre-aligned to the run's BOLD TR grid: one log-Mel vector per
TR, ready to feed an encoding model without a downstream binning step.
Inputs¶
Stimulus audio under stimuli/ — the same files transcribe
reads, selected by --audio-ext:
<dataset-root>/stimuli/dyad-030/ses-1/audio/
└── dyad-030_ses-1_task-conv_run-1_audio.wav
Aligning to the TR grid needs the run's BOLD timing (TR and number of frames). The dyad's partners share one simultaneous scan, so any partner's BOLD supplies it. A dyad with no resolvable BOLD raises.
Options¶
| Option | Description | Default |
|---|---|---|
--audio-ext |
Extension of the audio files, e.g. .wav (required) |
— |
--model |
Whisper model whose extractor produces the spectrogram: tiny, base, small, medium, large-v2, large-v3 |
tiny |
--model-dir |
Cache dir for downloaded weights | ~/.cache/hypline/huggingface |
--desc |
Tag outputs as a named variant (alphanumeric), e.g. --desc v2 → desc-v2 |
none |
--dyad-ids |
Comma-separated dyad IDs to process; omit for all | all |
--data-filters |
Narrow to specific runs/conditions — see Segments and metadata | none |
--force |
Overwrite existing outputs (default skips them) | off |
No --device, no confounds
The Whisper feature extractor is a CPU Mel transform with no forward pass, so
there is no --device option. And unlike phonemic and semantic, spectral
has no chained confoundgen step — it writes features only.
Example¶
Generate spectral features for all dyads with the default tiny extractor:
hypline featuregen spectral data/ --audio-ext .wav
Outputs¶
A spectral feature file per stimulus, tagged feat-spectral, under features/:
<dataset-root>/features/dyad-030/ses-1/spectral/
└── dyad-030_ses-1_task-conv_run-1_feat-spectral.parquet
A --desc label lands as desc-<label> in its own subdirectory
(spectral-<label>/). See
The hypline dataset layout.
Feature file format
Unlike the per-word/per-phoneme feature files, spectral rows are
per-TR: each row is one TR with its start_time (seconds from the start
of the stimulus) and a log-Mel feature vector (length = the model's number
of Mel bands). The Parquet footer records the model, sampling_rate,
hop_length, n_mels, chunk_length, repetition_time, and
downsample_method.
featuregen syntactic¶
Derive per-token syntactic-function features from word-level transcripts with
spaCy. Each token's feature is a fixed-width one-hot of its
POS tag concatenated with its dependency relation, plus a final 0/1 stopword
dimension. The model is fixed (en_core_web_lg) and auto-downloaded on first use
— there is no --model option.
Words are tokenized and tagged one conversational turn at a time: the
dependency parser needs coherent utterances, so words are grouped by turn_sub
(which subject held the floor) and each maximal run is parsed as one document.
A word that spaCy splits into several tokens ("don't" → do + n't) yields
one row per piece, each inheriting the source word's start_time by char-span
overlap.
Inputs¶
The same transcripts as featuregen phonemic, under
stimuli/. Untimed words (null start_time) are retained as parse context — a
dropped word would fragment its turn's sentence and mis-tag neighbors — but carry
their null timing into the output. Null-word rows are dropped and warned.
Options¶
| Option | Description | Default |
|---|---|---|
--desc |
Tag outputs as a named variant (alphanumeric), e.g. --desc v2 → desc-v2 |
none |
--dyad-ids |
Comma-separated dyad IDs to process; omit for all | all |
--data-filters |
Narrow to specific runs/conditions — see Segments and metadata | none |
--force |
Overwrite existing outputs (default skips them) | off |
Fixed model, no --device, no confounds
The spaCy model is fixed (en_core_web_lg), so there is no --model option;
the parse runs on CPU, so there is no --device either. And unlike phonemic
and semantic, syntactic has no chained confoundgen step — it writes
features only.
Example¶
Generate syntactic features for all dyads:
hypline featuregen syntactic data/
Outputs¶
A syntactic feature file per transcript, tagged feat-syntactic, under
features/:
<dataset-root>/features/dyad-030/ses-1/syntactic/
└── dyad-030_ses-1_task-conv_run-1_feat-syntactic.parquet
A --desc label lands as desc-<label> in its own subdirectory
(syntactic-<label>/). See
The hypline dataset layout.
Feature file format
Each row is one spaCy token with its start_time (seconds from the start
of the stimulus), its turn_sub label (the utterance the parse grouped on),
the token text, its source word, and a one-hot
feature vector (POS ⊕ dependency ⊕ stopword). Width and column order are fit
to the model's full label vocabulary, so they are fixed across transcripts; a
label outside that vocabulary leaves its block all-zero. The Parquet footer
records spacy_model and the feature_dim_labels naming each dimension.