Translation¶

Translate text documents using HuggingFace Seq2Seq translation models or text-generation (causal LM) models with a translation prompt.

Parameters¶

Parameter	Default	Description
`--model`	`google/madlad400-3b-mt`	HuggingFace model repo ID
`--revision`	`main`	Model revision (branch, tag, or commit hash)
`--cache-dir`		HuggingFace cache directory for model files
`--device`	`auto`	Device to use (`cuda`, `cpu`, or `auto`)
`--source-lang`	`en`	Source language code (e.g. `en`, `de`, `zh`)
`--target-lang`	`de`	Target language code (e.g. `de`, `en`, `fr`)
`--max-length`	`512`	Maximum number of tokens to generate per chunk
`--prompt`	(see below)	Prompt template for text-generation models (uses `{source_lang}`, `{target_lang}`, `{text}`)
`--encoding`	`utf-8-sig`	Input file encoding

Note

For MADLAD-400 (default), --source-lang and --target-lang use ISO 639-1 language codes (e.g. en, de, fr, zh, ja). For OPUS-MT models, the language pair is encoded in the model name and these params are ignored.

Chunking Strategy¶

Input text is split into sentences and packed into chunks that fit within the model's token limit. This preserves sentence boundaries and provides surrounding context for better translation quality.

If a single sentence exceeds the token limit, it is split at token boundaries as a last resort.

Output Format¶

Plain text, encoded as UTF-8.

Models¶

Any HuggingFace translation (Seq2Seq) model is supported. Causal LMs from the text-generation pipeline can also be used with a --prompt template.

Many-to-many (Seq2Seq)¶

Model	Params	Languages	License
`google/madlad400-3b-mt` (default)	3B	400+	Apache 2.0
`google/madlad400-7b-mt`	7B	400+	Apache 2.0

Single language pair (OPUS-MT)¶

For lightweight, single-pair translation. The naming convention is Helsinki-NLP/opus-mt-{src}-{tgt}.

Model	Direction	License
`Helsinki-NLP/opus-mt-en-de`	English → German	CC-BY-4.0
`Helsinki-NLP/opus-mt-de-en`	German → English	CC-BY-4.0
`Helsinki-NLP/opus-mt-en-fr`	English → French	CC-BY-4.0
`Helsinki-NLP/opus-mt-en-es`	English → Spanish	CC-BY-4.0
`Helsinki-NLP/opus-mt-en-zh`	English → Chinese	CC-BY-4.0

Browse all available language pairs on the Helsinki-NLP hub page.

Text-generation (causal LM)¶

Any causal language model can be used for translation via the --prompt parameter. The model type is auto-detected: if the model is not an encoder-decoder, it is loaded as a text-generation pipeline and the prompt template is used to format each chunk.

The default prompt template is:

Translate the following text from {source_lang} to {target_lang}. Output only the translation, nothing else.

{text}

Examples¶

Translate English to German¶

Uses the default MADLAD-400 model with 400+ language support.

ConfigInputOutput

config.yaml

tasks:
  - name: translate
    kind: local
    module: tigerflow_ml.text.translate.local
    input_ext: .txt
    output_ext: .txt

article.txt

The quick brown fox jumps over the lazy dog. This sentence
contains every letter of the English alphabet. It has been
used as a typing exercise for over a century.

article.txt

Der schnelle braune Fuchs springt über den faulen Hund.
Dieser Satz enthält jeden Buchstaben des englischen Alphabets.
Er wird seit über einem Jahrhundert als Tippübung verwendet.

Translate English to Chinese¶

ConfigInputOutput

config.yaml

tasks:
  - name: translate
    kind: local
    module: tigerflow_ml.text.translate.local
    input_ext: .txt
    output_ext: .txt
    params:
      target_lang: zh

abstract.txt

Artificial intelligence is transforming academic research
across all disciplines.

abstract.txt

人工智能正在改变所有学科的学术研究。

Use OPUS-MT for a specific language pair¶

For lightweight, single-pair translation without downloading a large multilingual model:

ConfigInputOutput

config.yaml

tasks:
  - name: translate
    kind: local
    module: tigerflow_ml.text.translate.local
    input_ext: .txt
    output_ext: .txt
    params:
      model: Helsinki-NLP/opus-mt-es-en

documento.txt

La inteligencia artificial está transformando la investigación
académica en todas las disciplinas.

documento.txt

Artificial intelligence is transforming academic research
across all disciplines.

Note

OPUS-MT models encode the language pair in the model name. The --source-lang and --target-lang params are ignored for these models.

Run on HPC with Slurm¶

For bulk translation of large document collections, use the Slurm variant to distribute work across compute nodes:

config.yaml

tasks:
  - name: translate
    kind: slurm
    module: tigerflow_ml.text.translate.slurm
    input_ext: .txt
    output_ext: .txt
    max_workers: 4
    worker_resources:
      cpus: 2
      gpus: 1
      memory: 16G
      time: 02:00:00
    params:
      target_lang: zh