OCR¶
Extract text from images and PDFs using HuggingFace image-text-to-text models.
Parameters¶
| Parameter | Default | Description |
|---|---|---|
--model |
HuggingFace model repo ID | |
--revision |
main |
Model revision (branch, tag, or commit hash) |
--cache-dir |
HuggingFace cache directory for model files | |
--allow-fetch |
--no-allow-fetch |
Allow downloads from HuggingFace Hub (network access required) |
--system-message |
System message for chat models | |
--max-tokens |
4096 |
Maximum number of tokens to generate per image |
--max-model-len |
Maximum sequence length (input + output tokens) passed to vLLM. Set this for large-context models to avoid OOM. | |
--temperature |
0 |
The model temperature. Lower numbers make models more deterministic |
--seed |
42 |
The seed to set for more reproducible behavior |
--llm-kwargs |
{} |
Additional kwargs for vLLM's LLM() constructor. Supplied values override task defaults. |
--sampling-kwargs |
{} |
Additional kwargs for vLLM's SamplingParams() constructor. Supplied values override task defaults. |
--chat-kwargs |
{} |
Additional kwargs for vLLM's LLM.chat(). Supplied values override task defaults. |
--prompt |
Extract all text from this image. |
Prompt for image-text-to-text models |
Supported Input Formats¶
- Image files (PNG, JPEG, TIFF, etc.)
- PDF files (each page is rendered and processed separately)
Output Format¶
Plain text. For multi-page inputs, pages are separated by form-feed characters (\f).
Models¶
Any HuggingFace image-text-to-text model is supported. The GOT-OCR model is recommended for general-purpose English document OCR. For multilingual documents, see the alternatives below.
| Model | Params | Description | License |
|---|---|---|---|
stepfun-ai/GOT-OCR-2.0-hf |
600M | Full-page OCR with format preservation | Apache 2.0 |
rednote-hilab/dots.ocr |
3B | Multilingual document parsing (100+ languages) with layout detection | MIT |
zai-org/GLM-4.1V-9B-Thinking |
10B | Bilingual (English/Chinese) VLM with reasoning, up to 4K image resolution | MIT |
Qwen/Qwen2.5-VL-7B-Instruct |
7B | General-purpose VLM with strong OCR and multilingual support | Apache 2.0 |
Examples¶
Extract text from a handwritten document¶
config.yaml
tasks:
- name: ocr
kind: local
module: tigerflow_ml.text.ocr.local
input_ext: .jpg
params:
model: stepfun-ai/GOT-OCR-2.0-hf
allow-fetch: True #if model is not already downloaded

census-form.txt
The number of Persons within the Division taken by Charles C. Paine
consisting of part of Geauga County, Ohio, and also the number of
persons within the Division Allotted to Eleazer Paine consisting of
the residue of said County, appears in a schedule here unto annexed,
and by us subscribed this 3rd day of December in the year one
thousand eight hundred & twenty.
Charles C. Paine Assistants to the
Eleazer Paine Marshall of Ohio
Schedule of the whole number of Persons in the County of Geauga
...
census-form.md
The number of Persons within the Division taken by **Charles C. Paine**
consisting of part of **Geauga County, Ohio**, and also the number of
persons within the Division Allotted to **Eleazer Paine** consisting of
the residue of said County...
*Schedule of the whole number of Persons in the County of Geauga*
...
census-form.json
{
"pages": [
{
"page": 1,
"text": "The number of Persons within the Division taken by Charles C. Paine consisting of part of Geauga County, Ohio..."
}
]
}
Extract text from a document with tables¶
config.yaml
tasks:
- name: ocr
kind: local
module: tigerflow_ml.text.ocr.local
input_ext: .png
output_ext: .txt
params:
model: stepfun-ai/GOT-OCR-2.0-hf
allow_fetch: True

abstract.txt
STATISTICAL ABSTRACT OF THE UNITED STATES
1. AREA AND POPULATION
No. 1.—TERRITORIAL EXPANSION OF CONTINENTAL UNITED STATES AND
ACQUISITIONS OF OUTLYING TERRITORIES AND POSSESSIONS
ACCESSION Date Gross area, square miles
Aggregate (1930) 3,738,395
Continental United States 3,026,789
Territory in 1790 892,135
Louisiana Purchase 1803 827,987
Florida 1819 58,666
...
abstract.md
# STATISTICAL ABSTRACT OF THE UNITED STATES
## 1. AREA AND POPULATION
**No. 1.—Territorial Expansion of Continental United States and
Acquisitions of Outlying Territories and Possessions**
| ACCESSION | Date | Gross area, square miles |
|---|---|---|
| Aggregate (1930) | | 3,738,395 |
| Continental United States | | 3,026,789 |
| Territory in 1790 | | 892,135 |
| Louisiana Purchase | 1803 | 827,987 |
| Florida | 1819 | 58,666 |
| By treaty with Spain | 1819 | 13,435 |
| Texas | 1845 | 393,196 |
| Oregon | 1846 | 286,541 |
| Mexican Cession | 1848 | 529,189 |
| Gadsden Purchase | 1853 | 29,670 |
...
abstract.json
{
"pages": [
{
"page": 1,
"text": "STATISTICAL ABSTRACT OF THE UNITED STATES\n\n1. AREA AND POPULATION\n\nNo. 1.—TERRITORIAL EXPANSION OF CONTINENTAL UNITED STATES..."
}
]
}
Extract text from a multi-page PDF¶
config.yaml
tasks:
- name: ocr
kind: local
module: tigerflow_ml.text.ocr.local
input_ext: .pdf
output_ext: .txt
params:
model: stepfun-ai/GOT-OCR-2.0-hf
allow_fetch: True
A multi-page PDF document, e.g. 2602.15607v1.pdf.
2602.15607v1.txt
[Page 1 text...]
␌
[Page 2 text...]
␌
...
Each page is separated by a form-feed character (\f, shown as ␌).
2602.15607v1.md
[Page 1 formatted text...]
␌
[Page 2 formatted text...]
␌
...
2602.15607v1.json
{
"pages": [
{
"page": 1,
"text": "..."
},
{
"page": 2,
"text": "..."
}
]
}
Run on HPC with Slurm¶
For bulk OCR across large document collections, use the Slurm variant to distribute work across compute nodes:
config.yaml
tasks:
- name: ocr
kind: slurm
module: tigerflow_ml.text.ocr.slurm
input_ext: .pdf
output_ext: .txt
max_workers: 4
worker_resources:
cpus: 2
gpus: 1
memory: 16G
time: 04:00:00
params:
model: stepfun-ai/GOT-OCR-2.0-hf
cache_dir: ~/path/to/model/hub/