Skip to content

Transcription

Transcribe audio to text using HuggingFace Whisper models.

Parameters

Parameter Default Description
--model openai/whisper-large-v3 HuggingFace model repo ID
--revision main Model revision (branch, tag, or commit hash)
--cache-dir HuggingFace cache directory for model files
--device auto Device to use (cuda, cpu, or auto)
--language Source language code (e.g. en, de). Empty for auto-detect
--output-format text Output format: text, srt, or json
--batch-size 16 Batch size for processing audio chunks
--chunk-length-s 30.0 Length of audio chunks in seconds
--stride-length-s 5.0 Overlap between chunks in seconds
--return-timestamps false Return word/segment timestamps (required for SRT)

Supported Input Formats

Audio files supported by the Whisper pipeline (WAV, MP3, FLAC, OGG, etc.).

Output Format

Depends on --output-format: plain text (default), SRT subtitles, or JSON with timestamps. See examples below.

Models

Any HuggingFace automatic-speech-recognition model is supported. The OpenAI Whisper family is recommended.

Model Params Speed License
openai/whisper-large-v3 (default) 1.5B 1x Apache 2.0
openai/whisper-large-v3-turbo 809M 3x MIT
openai/whisper-medium 769M 4x MIT
openai/whisper-small 244M 9x MIT
openai/whisper-base 74M 24x MIT
openai/whisper-tiny 39M 50x MIT

Tip

whisper-large-v3-turbo provides a good balance between accuracy and speed — nearly as accurate as large-v3 at roughly 3x the speed.

Examples

Transcribe audio to text

config.yaml
tasks:
  - name: transcribe
    kind: local
    module: tigerflow_ml.audio.transcribe.local
    input_ext: .mp3
    output_ext: .txt  # or .srt, .json
    params:
      # output_format: text   # (default) plain text
      # output_format: srt    # SRT subtitles (requires return_timestamps)
      # output_format: json   # raw Whisper output with timestamps
      # return_timestamps: true

An audio recording, e.g. lecture.mp3.

lecture.txt
Welcome to today's lecture on distributed computing. We will begin by
reviewing the fundamentals of parallel processing and then move on to
discuss fault tolerance in large-scale systems.
lecture.srt
1
00:00:00,000 --> 00:00:04,500
Welcome to today's lecture on distributed computing.

2
00:00:04,500 --> 00:00:09,200
We will begin by reviewing the fundamentals of parallel processing.

3
00:00:09,200 --> 00:00:14,800
And then move on to discuss fault tolerance in large-scale systems.
lecture.json
{
  "text": "Welcome to today's lecture on distributed computing...",
  "chunks": [
    {
      "text": "Welcome to today's lecture on distributed computing.",
      "timestamp": [0.0, 4.5]
    },
    {
      "text": "We will begin by reviewing the fundamentals of parallel processing.",
      "timestamp": [4.5, 9.2]
    }
  ]
}

Note

SRT and JSON output require return_timestamps: true.

Transcribe with language hint

Setting --language skips auto-detection and can improve accuracy when the source language is known:

config.yaml
tasks:
  - name: transcribe
    kind: local
    module: tigerflow_ml.audio.transcribe.local
    input_ext: .mp3
    output_ext: .txt
    params:
      language: en

Run on HPC with Slurm

For bulk transcription of large audio collections, use the Slurm variant to distribute work across compute nodes:

config.yaml
tasks:
  - name: transcribe
    kind: slurm
    module: tigerflow_ml.audio.transcribe.slurm
    input_ext: .mp3
    output_ext: .txt
    max_workers: 4
    worker_resources:
      cpus: 2
      gpus: 1
      memory: 16G
      time: 04:00:00