Live Video Captioning#

Live Video Captioning deploys AI-powered captioning for live video streams with Deep Learning Streamer (DL Streamer) and OpenVINO™ Vision Language Models. You can process RTSP streams, generate real-time captions, and monitor performance metrics on a dashboard.

The key features are:

Multi-Model Support: Switch between VLMs (InternVL2, Gemma-3, etc.) with automatic model discovery from ov_models/.

Real-time Streaming: WebRTC-based low-latency preview video delivery.

Performance Metrics: Live charts for CPU/GPU/RAM and inference metrics such as TTFT, TPOT, and throughput.

Modular Architecture: Containerized services with clearly separated backend, frontend, and pipeline configuration.

Alert Mode: Optional alert styling for binary classification prompts (“Yes”/“No”).

Object-Detection-Model Support: Optionally integrate YOLO-based detection models into the pipeline to enable object detection and frame filtering.

Use Cases#

Real-time Video Analytics: Monitor security cameras, industrial equipment, or public spaces with AI-powered scene understanding and automatic captioning.

Accessibility Enhancement: Generate live captions for video content, making streams accessible to users with hearing impairments.

Performance Benchmarking: Evaluate VLM performance on Intel® hardware by comparing throughput, latency, and resource utilization across different models and pipeline configurations.

Intelligent Surveillance: Deploy custom prompts (for example, “Is there a person in the frame?”) for security and safety monitoring workflows.