OCR¶

Extract text from images and PDFs using HuggingFace image-text-to-text models.

Parameters¶

Parameter	Default	Description
`--model`	`stepfun-ai/GOT-OCR-2.0-hf`	HuggingFace model repo ID
`--revision`	`main`	Model revision (branch, tag, or commit hash)
`--cache-dir`		HuggingFace cache directory for model files
`--device`	`auto`	Device to use (`cuda`, `cpu`, or `auto`)
`--output-format`	`text`	Output format: `text`, `markdown`, or `json`
`--max-length`	`4096`	Maximum number of tokens to generate per image
`--batch-size`	`4`	Number of images to process in parallel on GPU
`--prompt`	`Extract all text from this image.`	Prompt for image-text-to-text models

Supported Input Formats¶

Image files (PNG, JPEG, TIFF, etc.)
PDF files (each page is rendered and processed separately)

Output Format¶

Depends on --output-format:

text (default) — Plain text. For multi-page inputs, pages are separated by form-feed characters (\f).
markdown — Formatted output preserving tables, equations, and document structure as markdown/LaTeX.
json — Structured JSON with per-page text: {"pages": [{"page": 1, "text": "..."}, ...]}.

Models¶

Any HuggingFace image-text-to-text model is supported. The GOT-OCR model is recommended for general-purpose document OCR.

Model	Params	Description	License
`stepfun-ai/GOT-OCR-2.0-hf` (default)	600M	Full-page OCR with format preservation	Apache 2.0

Tip

GOT-OCR supports plain text and formatted (markdown/LaTeX) output. Use --output-format markdown to preserve tables, equations, and document structure.

Examples¶

Extract text from a handwritten document¶

ConfigInputOutput (.txt)Output (.md)Output (.json)

config.yaml

tasks:
  - name: ocr
    kind: local
    module: tigerflow_ml.text.ocr.local
    input_ext: .jpg
    output_ext: .txt  # or .md, .json
    params:
      # output_format: text       # (default) plain text
      # output_format: markdown   # formatted markdown/LaTeX
      # output_format: json       # structured JSON with pages

1820 handwritten census form

census-form.txt

The number of Persons within the Division taken by Charles C. Paine
consisting of part of Geauga County, Ohio, and also the number of
persons within the Division Allotted to Eleazer Paine consisting of
the residue of said County, appears in a schedule here unto annexed,
and by us subscribed this 3rd day of December in the year one
thousand eight hundred & twenty.
    Charles C. Paine    Assistants to the
    Eleazer Paine       Marshall of Ohio

Schedule of the whole number of Persons in the County of Geauga
...

census-form.md

The number of Persons within the Division taken by **Charles C. Paine**
consisting of part of **Geauga County, Ohio**, and also the number of
persons within the Division Allotted to **Eleazer Paine** consisting of
the residue of said County...

*Schedule of the whole number of Persons in the County of Geauga*

...

census-form.json

{
  "pages": [
    {
      "page": 1,
      "text": "The number of Persons within the Division taken by Charles C. Paine consisting of part of Geauga County, Ohio..."
    }
  ]
}

Extract text from a document with tables¶

ConfigInputOutput (.txt)Output (.md)Output (.json)

config.yaml

tasks:
  - name: ocr
    kind: local
    module: tigerflow_ml.text.ocr.local
    input_ext: .png
    output_ext: .md
    params:
      output_format: markdown

Statistical Abstract of the United States

abstract.txt

STATISTICAL ABSTRACT OF THE UNITED STATES

1. AREA AND POPULATION

No. 1.—TERRITORIAL EXPANSION OF CONTINENTAL UNITED STATES AND
ACQUISITIONS OF OUTLYING TERRITORIES AND POSSESSIONS

ACCESSION    Date    Gross area, square miles
Aggregate (1930)    3,738,395
Continental United States    3,026,789
Territory in 1790    892,135
Louisiana Purchase    1803    827,987
Florida    1819    58,666
...

abstract.md

# STATISTICAL ABSTRACT OF THE UNITED STATES

## 1. AREA AND POPULATION

**No. 1.—Territorial Expansion of Continental United States and
Acquisitions of Outlying Territories and Possessions**

| ACCESSION | Date | Gross area, square miles |
|---|---|---|
| Aggregate (1930) | | 3,738,395 |
| Continental United States | | 3,026,789 |
| Territory in 1790 | | 892,135 |
| Louisiana Purchase | 1803 | 827,987 |
| Florida | 1819 | 58,666 |
| By treaty with Spain | 1819 | 13,435 |
| Texas | 1845 | 393,196 |
| Oregon | 1846 | 286,541 |
| Mexican Cession | 1848 | 529,189 |
| Gadsden Purchase | 1853 | 29,670 |
...

abstract.json

{
  "pages": [
    {
      "page": 1,
      "text": "STATISTICAL ABSTRACT OF THE UNITED STATES\n\n1. AREA AND POPULATION\n\nNo. 1.—TERRITORIAL EXPANSION OF CONTINENTAL UNITED STATES..."
    }
  ]
}

Extract text from a multi-page PDF¶

ConfigInputOutput (.txt)Output (.md)Output (.json)

config.yaml

tasks:
  - name: ocr
    kind: local
    module: tigerflow_ml.text.ocr.local
    input_ext: .pdf
    output_ext: .txt

A multi-page PDF document, e.g. 2602.15607v1.pdf.

2602.15607v1.txt

[Page 1 text...]
␌
[Page 2 text...]
␌
...

Each page is separated by a form-feed character (\f, shown as ␌).

2602.15607v1.md

[Page 1 formatted text...]
␌
[Page 2 formatted text...]
␌
...

2602.15607v1.json

{
  "pages": [
    {
      "page": 1,
      "text": "..."
    },
    {
      "page": 2,
      "text": "..."
    }
  ]
}

Run on HPC with Slurm¶

For bulk OCR across large document collections, use the Slurm variant to distribute work across compute nodes:

config.yaml

tasks:
  - name: ocr
    kind: slurm
    module: tigerflow_ml.text.ocr.slurm
    input_ext: .pdf
    output_ext: .txt
    max_workers: 4
    worker_resources:
      cpus: 2
      gpus: 1
      memory: 16G
      time: 04:00:00