Skip to content

OCR

Extract text from images and PDFs using HuggingFace image-text-to-text models.

Parameters

Parameter Default Description
--model HuggingFace model repo ID
--revision main Model revision (branch, tag, or commit hash)
--cache-dir HuggingFace cache directory for model files
--allow-fetch --no-allow-fetch Allow downloads from HuggingFace Hub (network access required)
--system-message System message for chat models
--max-tokens 4096 Maximum number of tokens to generate per image
--max-model-len Maximum sequence length (input + output tokens) passed to vLLM. Set this for large-context models to avoid OOM.
--temperature 0 The model temperature. Lower numbers make models more deterministic
--seed 42 The seed to set for more reproducible behavior
--llm-kwargs {} Additional kwargs for vLLM's LLM() constructor. Supplied values override task defaults.
--sampling-kwargs {} Additional kwargs for vLLM's SamplingParams() constructor. Supplied values override task defaults.
--chat-kwargs {} Additional kwargs for vLLM's LLM.chat(). Supplied values override task defaults.
--prompt Extract all text from this image. Prompt for image-text-to-text models

Supported Input Formats

  • Image files (PNG, JPEG, TIFF, etc.)
  • PDF files (each page is rendered and processed separately)

Output Format

Plain text. For multi-page inputs, pages are separated by form-feed characters (\f).

Models

Any HuggingFace image-text-to-text model is supported. The GOT-OCR model is recommended for general-purpose English document OCR. For multilingual documents, see the alternatives below.

Model Params Description License
stepfun-ai/GOT-OCR-2.0-hf 600M Full-page OCR with format preservation Apache 2.0
rednote-hilab/dots.ocr 3B Multilingual document parsing (100+ languages) with layout detection MIT
zai-org/GLM-4.1V-9B-Thinking 10B Bilingual (English/Chinese) VLM with reasoning, up to 4K image resolution MIT
Qwen/Qwen2.5-VL-7B-Instruct 7B General-purpose VLM with strong OCR and multilingual support Apache 2.0

Examples

Extract text from a handwritten document

config.yaml
tasks:
  - name: ocr
    kind: local
    module: tigerflow_ml.text.ocr.local
    input_ext: .jpg
    params:
      model: stepfun-ai/GOT-OCR-2.0-hf
      allow-fetch: True #if model is not already downloaded

1820 handwritten census form

census-form.txt
The number of Persons within the Division taken by Charles C. Paine
consisting of part of Geauga County, Ohio, and also the number of
persons within the Division Allotted to Eleazer Paine consisting of
the residue of said County, appears in a schedule here unto annexed,
and by us subscribed this 3rd day of December in the year one
thousand eight hundred & twenty.
    Charles C. Paine    Assistants to the
    Eleazer Paine       Marshall of Ohio

Schedule of the whole number of Persons in the County of Geauga
...
census-form.md
The number of Persons within the Division taken by **Charles C. Paine**
consisting of part of **Geauga County, Ohio**, and also the number of
persons within the Division Allotted to **Eleazer Paine** consisting of
the residue of said County...

*Schedule of the whole number of Persons in the County of Geauga*

...
census-form.json
{
  "pages": [
    {
      "page": 1,
      "text": "The number of Persons within the Division taken by Charles C. Paine consisting of part of Geauga County, Ohio..."
    }
  ]
}

Extract text from a document with tables

config.yaml
tasks:
  - name: ocr
    kind: local
    module: tigerflow_ml.text.ocr.local
    input_ext: .png
    output_ext: .txt
    params:
      model: stepfun-ai/GOT-OCR-2.0-hf
      allow_fetch: True

Statistical Abstract of the United States

abstract.txt
STATISTICAL ABSTRACT OF THE UNITED STATES

1. AREA AND POPULATION

No. 1.—TERRITORIAL EXPANSION OF CONTINENTAL UNITED STATES AND
ACQUISITIONS OF OUTLYING TERRITORIES AND POSSESSIONS

ACCESSION    Date    Gross area, square miles
Aggregate (1930)    3,738,395
Continental United States    3,026,789
Territory in 1790    892,135
Louisiana Purchase    1803    827,987
Florida    1819    58,666
...
abstract.md
# STATISTICAL ABSTRACT OF THE UNITED STATES

## 1. AREA AND POPULATION

**No. 1.—Territorial Expansion of Continental United States and
Acquisitions of Outlying Territories and Possessions**

| ACCESSION | Date | Gross area, square miles |
|---|---|---|
| Aggregate (1930) | | 3,738,395 |
| Continental United States | | 3,026,789 |
| Territory in 1790 | | 892,135 |
| Louisiana Purchase | 1803 | 827,987 |
| Florida | 1819 | 58,666 |
| By treaty with Spain | 1819 | 13,435 |
| Texas | 1845 | 393,196 |
| Oregon | 1846 | 286,541 |
| Mexican Cession | 1848 | 529,189 |
| Gadsden Purchase | 1853 | 29,670 |
...
abstract.json
{
  "pages": [
    {
      "page": 1,
      "text": "STATISTICAL ABSTRACT OF THE UNITED STATES\n\n1. AREA AND POPULATION\n\nNo. 1.—TERRITORIAL EXPANSION OF CONTINENTAL UNITED STATES..."
    }
  ]
}

Extract text from a multi-page PDF

config.yaml
tasks:
  - name: ocr
    kind: local
    module: tigerflow_ml.text.ocr.local
    input_ext: .pdf
    output_ext: .txt
    params:
      model: stepfun-ai/GOT-OCR-2.0-hf
      allow_fetch: True

A multi-page PDF document, e.g. 2602.15607v1.pdf.

2602.15607v1.txt
[Page 1 text...]
␌
[Page 2 text...]
␌
...

Each page is separated by a form-feed character (\f, shown as ).

2602.15607v1.md
[Page 1 formatted text...]
␌
[Page 2 formatted text...]
␌
...
2602.15607v1.json
{
  "pages": [
    {
      "page": 1,
      "text": "..."
    },
    {
      "page": 2,
      "text": "..."
    }
  ]
}

Run on HPC with Slurm

For bulk OCR across large document collections, use the Slurm variant to distribute work across compute nodes:

config.yaml
tasks:
  - name: ocr
    kind: slurm
    module: tigerflow_ml.text.ocr.slurm
    input_ext: .pdf
    output_ext: .txt
    max_workers: 4
    worker_resources:
      cpus: 2
      gpus: 1
      memory: 16G
      time: 04:00:00
    params:
      model: stepfun-ai/GOT-OCR-2.0-hf
      cache_dir: ~/path/to/model/hub/