Live Video Captioning RAG#

Live Video Captioning RAG sample application uses the Retrieval-Augmentation Generation technique, which transforms live video captions into a knowledge base. The sample application ingests captions from the Live Video Captioning sample application, generates semantic embeddings, and uses LLMs optimized through the OpenVINO™ toolkit to deliver AI-powered chatbot responses grounded in the video context. The sample application builds searchable caption embeddings and interacts with the video content through natural language queries.

Key Features#

  • RAG-based Video Context: Converts caption text from video frames into embeddings and store them in a vector database for semantic search and retrieval.

  • OpenVINO toolkit-LLM Integration: Deploys large language models efficiently on Intel® hardware for context-aware response generation.

  • Interactive Chat Interface: Web-based dashboard for querying video content with streaming responses and an inline preview of retrieved frames and captions.

  • Multi-Model Support: Configurable embedding models and LLM models with flexible model switching for different use cases and performance requirements.

  • Multi-Device Support: CPU and GPU device options for embedding generation and LLM inference, optimized for Intel® platforms.

  • REST API Endpoints: Programmatic access to embedding ingestion (/api/embeddings) and chat queries (/api/chat) for integration with external systems.

  • Streaming Responses: Real-time chat responses with full caption context and visual frame references for enhanced user understanding.

  • Deployment through Docker Compose tool: Containerized stack for simplified setup and deployment across different environments.

Use Cases#

  • Video Content Search and Discovery: Build searchable knowledge bases from surveillance, educational, or archival videos to find relevant scenes (or frames) and information quickly using natural language queries.

  • Real-time Video Analytics with Q&A: Monitor live video feeds with the ability to ask questions about the video content and receive answers grounded in actual video captions and context.

  • Accessibility and Content Understanding: Generate and query video captions to make the video content more accessible, and enable users to understand the video content without watching the full stream.

  • Intelligent Security and Safety: Deploy RAG-backed chatbots for security monitoring workflows to answer questions about events, activities, and anomalies detected in surveillance video streams.