Object Detection¶

Detect and locate objects in images and videos using HuggingFace detection models.

Supports both fixed-class models (e.g. RT-DETR) and open-vocabulary models (e.g. Grounding DINO). The pipeline type is resolved automatically from the model.

Parameters¶

Parameter	Default	Description
`--model`	`PekingU/rtdetr_r50vd`	HuggingFace model repo ID
`--revision`	`main`	Model revision (branch, tag, or commit hash)
`--cache-dir`		HuggingFace cache directory for model files
`--device`	`auto`	Device to use (`cuda`, `cpu`, or `auto`)
`--labels`		Comma-separated labels for zero-shot detection (e.g. `cat,dog,person`)
`--threshold`	`0.3`	Minimum confidence score for detections
`--batch-size`	`4`	Number of video frames to process in parallel on GPU
`--sample-fps`	`1.0`	Frames per second to sample from video (0 = every frame)

Supported Input Formats¶

Image files (JPEG, PNG, TIFF, etc.)
Video files (MP4, AVI, MOV, MKV, WebM, FLV, WMV)

Output Format¶

JSON. The output structure depends on the input type.

Images produce a flat array of detections:

[
  {"label": "cat", "score": 0.96, "box": {"xmin": 343, "ymin": 24, "xmax": 640, "ymax": 371}},
  {"label": "remote", "score": 0.95, "box": {"xmin": 40, "ymin": 73, "xmax": 175, "ymax": 118}}
]

Videos produce a frame-indexed array:

[
  {
    "frame": 0,
    "timestamp": 0.0,
    "detections": [
      {"label": "person", "score": 0.92, "box": {"xmin": 100, "ymin": 50, "xmax": 300, "ymax": 400}}
    ]
  },
  {
    "frame": 30,
    "timestamp": 1.0,
    "detections": []
  }
]

Models¶

Any HuggingFace object-detection or zero-shot-object-detection model is supported.

Fixed-class (COCO 80 classes)¶

These models detect a fixed set of object categories without needing --labels.

Model	Params	COCO AP	License
`PekingU/rtdetr_r50vd` (default)	42M	53.1	Apache 2.0
`facebook/detr-resnet-50`	41M	42.0	Apache 2.0

Zero-shot (open vocabulary)¶

These models detect any object described by text. Requires --labels.

Model	Params	License
`IDEA-Research/grounding-dino-tiny`	172M	Apache 2.0
`IDEA-Research/grounding-dino-base`	341M	Apache 2.0
`google/owlv2-base-patch16-ensemble`	200M	Apache 2.0

Examples¶

Detect objects in images¶

Uses the default RT-DETR model which recognizes 80 common object categories (COCO classes).

ConfigInputOutput

config.yaml

tasks:
  - name: detect
    kind: local
    module: tigerflow_ml.image.detect.local
    input_ext: .jpg
    output_ext: .json

Input image

Annotated output

photo.json

[
  {"label": "sofa", "score": 0.97, "box": {"xmin": 0, "ymin": 0, "xmax": 640, "ymax": 476}},
  {"label": "cat", "score": 0.96, "box": {"xmin": 343, "ymin": 24, "xmax": 640, "ymax": 371}},
  {"label": "cat", "score": 0.96, "box": {"xmin": 13, "ymin": 54, "xmax": 318, "ymax": 472}},
  {"label": "remote", "score": 0.95, "box": {"xmin": 40, "ymin": 73, "xmax": 175, "ymax": 118}},
  {"label": "remote", "score": 0.92, "box": {"xmin": 333, "ymin": 76, "xmax": 369, "ymax": 186}}
]

Detect custom objects with zero-shot¶

Use an open-vocabulary model to detect arbitrary objects described by text labels.

Note

Zero-shot models require the --labels parameter. The model will search for objects matching the provided text descriptions.

ConfigInputOutput

config.yaml

tasks:
  - name: detect
    kind: local
    module: tigerflow_ml.image.detect.local
    input_ext: .jpg
    output_ext: .json
    params:
      model: IDEA-Research/grounding-dino-base
      labels: "solar panel,wind turbine,power line"
      threshold: 0.2

A satellite or aerial image, e.g. site_survey.jpg.

site_survey.json

[
  {
    "label": "solar panel",
    "score": 0.84,
    "box": {"xmin": 120, "ymin": 200, "xmax": 350, "ymax": 310}
  },
  {
    "label": "power line",
    "score": 0.71,
    "box": {"xmin": 0, "ymin": 50, "xmax": 640, "ymax": 65}
  }
]

Detect objects in video¶

The task automatically extracts frames from video at the specified sample rate and runs detection on each frame.

ConfigInputOutput

config.yaml

tasks:
  - name: detect
    kind: local
    module: tigerflow_ml.image.detect.local
    input_ext: .mp4
    output_ext: .json
    params:
      sample_fps: 2.0
      batch_size: 8

A video file, e.g. traffic.mp4 (30 fps, 10 seconds).

traffic.json

[
  {
    "frame": 0,
    "timestamp": 0.0,
    "detections": [
      {"label": "car", "score": 0.94, "box": {"xmin": 100, "ymin": 200, "xmax": 300, "ymax": 350}},
      {"label": "person", "score": 0.91, "box": {"xmin": 400, "ymin": 150, "xmax": 450, "ymax": 320}}
    ]
  },
  {
    "frame": 15,
    "timestamp": 0.5,
    "detections": [
      {"label": "car", "score": 0.92, "box": {"xmin": 150, "ymin": 200, "xmax": 350, "ymax": 350}}
    ]
  }
]

At sample_fps: 2.0, the task samples 2 frames per second from the video (20 frames total for a 10-second clip). Increase --batch-size to process more frames in parallel on the GPU.

Run on HPC with Slurm¶

For bulk detection across large image or video collections, use the Slurm variant to distribute work across compute nodes:

config.yaml

tasks:
  - name: detect
    kind: slurm
    module: tigerflow_ml.image.detect.slurm
    input_ext: .mp4
    output_ext: .json
    max_workers: 4
    worker_resources:
      cpus: 2
      gpus: 1
      memory: 16G
      time: 04:00:00
    params:
      sample_fps: 1.0
      batch_size: 8