OCR¶

Extract text from images and PDFs using HuggingFace image-text-to-text models.

Parameters¶

Parameter	Default	Description
`--model`		HuggingFace model repo ID
`--revision`	`main`	Model revision (branch, tag, or commit hash)
`--cache-dir`		HuggingFace cache directory for model files
`--allow-fetch`	`--no-allow-fetch`	Allow downloads from HuggingFace Hub (network access required)
`--system-message`		System message for chat models
`--max-tokens`	`4096`	Maximum number of tokens to generate per image
`--max-model-len`		Maximum sequence length (input + output tokens) passed to vLLM. Set this for large-context models to avoid OOM.
`--seed`	`42`	The seed to set for more reproducible behavior
`--llm-kwargs`	`{}`	Additional kwargs for vLLM's LLM() constructor. Supplied values override task defaults.
`--sampling-kwargs`	`{}`	Additional kwargs for vLLM's SamplingParams() constructor. Supplied values override task defaults.
`--chat-kwargs`	`{}`	Additional kwargs for vLLM's LLM.chat(). Supplied values override task defaults.
`--prompt`		Prompt for image-text-to-text models
`--json-schema`		Constrain the model's output to a JSON schema using vllm structured outputs. Provide the schema as a JSON string, e.g. `{"type":"object","properties":{"text":{"type":"string"}},"required":["text"]}`. Requires a model that supports structured outputs.

Supported Input Formats¶

Image files (PNG, JPEG, TIFF, etc.)
PDF files (each page is rendered and processed separately)

Output Format¶

text — Plain text (.txt); for multi-page inputs, pages are separated by form-feed characters (\f).
markdown — Formatted output preserving tables, equations, and document structure as markdown/LaTeX (.md); for multi-page inputs, pages are separated by form-feed characters (\f).
json — Structured JSON with per-page text (.json); each page's content is validated and they're all returned as a list.

[!WARNING] The output format is specified when setting --output-ext. This affects the final save format and triggers validation for .json format (also strips markdown formatting if present). However, this does not ensure model output is in the desired format. Make sure your --prompt has specific instructions specifying proper output format.

Models¶

Any HuggingFace image-text-to-text model is supported. The GOT-OCR model is recommended for general-purpose English document OCR. For multilingual documents, see the alternatives below.

Model	Params	Description	License
`stepfun-ai/GOT-OCR-2.0-hf`	600M	Full-page OCR with format preservation	Apache 2.0
`rednote-hilab/dots.ocr`	3B	Multilingual document parsing (100+ languages) with layout detection	MIT
`zai-org/GLM-4.1V-9B-Thinking`	10B	Bilingual (English/Chinese) VLM with reasoning, up to 4K image resolution	MIT
`Qwen/Qwen2.5-VL-7B-Instruct`	7B	General-purpose VLM with strong OCR and multilingual support	Apache 2.0

Examples¶

Extract text from a handwritten document¶

ConfigInputOutput

config.yaml

tasks:
  - name: ocr
    kind: local
    module: tigerflow_ml.text.ocr.local
    input_ext: .jpg
    output_ext: .txt
    params:
      model: stepfun-ai/GOT-OCR-2.0-hf
      prompt: "Extract all text from this image"
      allow_fetch: True #if model is not already downloaded

1820 handwritten census form

census-form.txt

The number of Persons within the Division taken by Charles C. Paine
consisting of part of Geauga County, Ohio, and also the number of
persons within the Division Allotted to Eleazer Paine consisting of
the residue of said County, appears in a schedule here unto annexed,
and by us subscribed this 3rd day of December in the year one
thousand eight hundred & twenty.
    Charles C. Paine    Assistants to the
    Eleazer Paine       Marshall of Ohio

Schedule of the whole number of Persons in the County of Geauga
...

Extract text from a document with tables¶

ConfigInputOutput

config.yaml

tasks:
  - name: ocr
    kind: local
    module: tigerflow_ml.text.ocr.local
    input_ext: .png
    output_ext: .md
    params:
      model: stepfun-ai/GOT-OCR-2.0-hf
      prompt: "Extract all text from this image with markdown formatting"
      allow_fetch: True

Statistical Abstract of the United States

abstract.md

# STATISTICAL ABSTRACT OF THE UNITED STATES

## 1. AREA AND POPULATION

**No. 1.—Territorial Expansion of Continental United States and
Acquisitions of Outlying Territories and Possessions**

| ACCESSION | Date | Gross area, square miles |
|---|---|---|
| Aggregate (1930) | | 3,738,395 |
| Continental United States | | 3,026,789 |
| Territory in 1790 | | 892,135 |
| Louisiana Purchase | 1803 | 827,987 |
| Florida | 1819 | 58,666 |
| By treaty with Spain | 1819 | 13,435 |
| Texas | 1845 | 393,196 |
| Oregon | 1846 | 286,541 |
| Mexican Cession | 1848 | 529,189 |
| Gadsden Purchase | 1853 | 29,670 |
...

Extract text from a multi-page PDF¶

ConfigInputOutput (.txt)Output (.md)Output (.json)

config.yaml

tasks:
  - name: ocr
    kind: local
    module: tigerflow_ml.text.ocr.local
    input_ext: .pdf
    params:
      model: stepfun-ai/GOT-OCR-2.0-hf
      prompt: "Extract all text from this image"
      allow_fetch: True

A multi-page PDF document, e.g. 2602.15607v1.pdf.

2602.15607v1.txt

[Page 1 text...]
␌
[Page 2 text...]
␌
...

Each page is separated by a form-feed character (\f, shown as ␌).

2602.15607v1.md

[Page 1 formatted text...]
␌
[Page 2 formatted text...]
␌
...

2602.15607v1.json

[
  Page 1 formatted text...,
  Page 2 formatted text...,
  ...
]

Run on HPC with Slurm¶

For bulk OCR across large document collections, use the Slurm variant to distribute work across compute nodes:

config.yaml

tasks:
  - name: ocr
    kind: slurm
    module: tigerflow_ml.text.ocr.slurm
    input_ext: .pdf
    output_ext: .txt
    max_workers: 4
    worker_resources:
      cpus: 1
      gpus: 1
      memory: 16G
      time: 04:00:00
    params:
      model: stepfun-ai/GOT-OCR-2.0-hf
      prompt: "Extract all text from this image"
      cache_dir: ~/path/to/model/hub/

Structured JSON output¶

If you're using a model which supports structured output, you can provide a structured json output schema using --json-schema

config.yaml

tasks:
  - name: ocr
    kind: slurm
    module: tigerflow_ml.text.ocr.slurm
    input_ext: .pdf
    output_ext: .txt
    max_workers: 4
    worker_resources:
      cpus: 1
      gpus: 2
      memory: 16G
      time: 04:00:00
    params:
      model: Qwen/Qwen2.5-VL-32B-Instruct
      prompt: "Extract all text from this image in valid json format"
      cache_dir: ~/path/to/model/hub/
      max-model-len: 4096
      json-schema: '{"type":"object","properties":{"text":{"type":"string"}},"required":["text"]}'