Skip to content

OCR

Extract text from images and PDFs using HuggingFace image-text-to-text models.

Parameters

Parameter Default Description
--model stepfun-ai/GOT-OCR-2.0-hf HuggingFace model repo ID
--revision main Model revision (branch, tag, or commit hash)
--cache-dir HuggingFace cache directory for model files
--device auto Device to use (cuda, cpu, or auto)
--output-format text Output format: text, markdown, or json
--max-length 4096 Maximum number of tokens to generate per image
--batch-size 4 Number of images to process in parallel on GPU
--prompt Extract all text from this image. Prompt for image-text-to-text models

Supported Input Formats

  • Image files (PNG, JPEG, TIFF, etc.)
  • PDF files (each page is rendered and processed separately)

Output Format

Depends on --output-format:

  • text (default) — Plain text. For multi-page inputs, pages are separated by form-feed characters (\f).
  • markdown — Formatted output preserving tables, equations, and document structure as markdown/LaTeX.
  • json — Structured JSON with per-page text: {"pages": [{"page": 1, "text": "..."}, ...]}.

Models

Any HuggingFace image-text-to-text model is supported. The GOT-OCR model is recommended for general-purpose document OCR.

Model Params Description License
stepfun-ai/GOT-OCR-2.0-hf (default) 600M Full-page OCR with format preservation Apache 2.0

Tip

GOT-OCR supports plain text and formatted (markdown/LaTeX) output. Use --output-format markdown to preserve tables, equations, and document structure.

Examples

Extract text from a handwritten document

config.yaml
tasks:
  - name: ocr
    kind: local
    module: tigerflow_ml.text.ocr.local
    input_ext: .jpg
    output_ext: .txt  # or .md, .json
    params:
      # output_format: text       # (default) plain text
      # output_format: markdown   # formatted markdown/LaTeX
      # output_format: json       # structured JSON with pages

1820 handwritten census form

census-form.txt
The number of Persons within the Division taken by Charles C. Paine
consisting of part of Geauga County, Ohio, and also the number of
persons within the Division Allotted to Eleazer Paine consisting of
the residue of said County, appears in a schedule here unto annexed,
and by us subscribed this 3rd day of December in the year one
thousand eight hundred & twenty.
    Charles C. Paine    Assistants to the
    Eleazer Paine       Marshall of Ohio

Schedule of the whole number of Persons in the County of Geauga
...
census-form.md
The number of Persons within the Division taken by **Charles C. Paine**
consisting of part of **Geauga County, Ohio**, and also the number of
persons within the Division Allotted to **Eleazer Paine** consisting of
the residue of said County...

*Schedule of the whole number of Persons in the County of Geauga*

...
census-form.json
{
  "pages": [
    {
      "page": 1,
      "text": "The number of Persons within the Division taken by Charles C. Paine consisting of part of Geauga County, Ohio..."
    }
  ]
}

Extract text from a document with tables

config.yaml
tasks:
  - name: ocr
    kind: local
    module: tigerflow_ml.text.ocr.local
    input_ext: .png
    output_ext: .md
    params:
      output_format: markdown

Statistical Abstract of the United States

abstract.txt
STATISTICAL ABSTRACT OF THE UNITED STATES

1. AREA AND POPULATION

No. 1.—TERRITORIAL EXPANSION OF CONTINENTAL UNITED STATES AND
ACQUISITIONS OF OUTLYING TERRITORIES AND POSSESSIONS

ACCESSION    Date    Gross area, square miles
Aggregate (1930)    3,738,395
Continental United States    3,026,789
Territory in 1790    892,135
Louisiana Purchase    1803    827,987
Florida    1819    58,666
...
abstract.md
# STATISTICAL ABSTRACT OF THE UNITED STATES

## 1. AREA AND POPULATION

**No. 1.—Territorial Expansion of Continental United States and
Acquisitions of Outlying Territories and Possessions**

| ACCESSION | Date | Gross area, square miles |
|---|---|---|
| Aggregate (1930) | | 3,738,395 |
| Continental United States | | 3,026,789 |
| Territory in 1790 | | 892,135 |
| Louisiana Purchase | 1803 | 827,987 |
| Florida | 1819 | 58,666 |
| By treaty with Spain | 1819 | 13,435 |
| Texas | 1845 | 393,196 |
| Oregon | 1846 | 286,541 |
| Mexican Cession | 1848 | 529,189 |
| Gadsden Purchase | 1853 | 29,670 |
...
abstract.json
{
  "pages": [
    {
      "page": 1,
      "text": "STATISTICAL ABSTRACT OF THE UNITED STATES\n\n1. AREA AND POPULATION\n\nNo. 1.—TERRITORIAL EXPANSION OF CONTINENTAL UNITED STATES..."
    }
  ]
}

Extract text from a multi-page PDF

config.yaml
tasks:
  - name: ocr
    kind: local
    module: tigerflow_ml.text.ocr.local
    input_ext: .pdf
    output_ext: .txt

A multi-page PDF document, e.g. 2602.15607v1.pdf.

2602.15607v1.txt
[Page 1 text...]
␌
[Page 2 text...]
␌
...

Each page is separated by a form-feed character (\f, shown as ).

2602.15607v1.md
[Page 1 formatted text...]
␌
[Page 2 formatted text...]
␌
...
2602.15607v1.json
{
  "pages": [
    {
      "page": 1,
      "text": "..."
    },
    {
      "page": 2,
      "text": "..."
    }
  ]
}

Run on HPC with Slurm

For bulk OCR across large document collections, use the Slurm variant to distribute work across compute nodes:

config.yaml
tasks:
  - name: ocr
    kind: slurm
    module: tigerflow_ml.text.ocr.slurm
    input_ext: .pdf
    output_ext: .txt
    max_workers: 4
    worker_resources:
      cpus: 2
      gpus: 1
      memory: 16G
      time: 04:00:00