Live Video Captioning RAG#
Live Video Captioning RAG sample application uses the Retrieval-Augmentation Generation technique, which transforms live video captions into a knowledge base. The sample application ingests captions from the Live Video Captioning sample application, generates semantic embeddings, and uses LLMs optimized through the OpenVINO™ toolkit to deliver AI-powered chatbot responses grounded in the video context. The sample application builds searchable caption embeddings and interacts with the video content through natural language queries.
Key Features#
RAG-based Video Context: Converts caption text from video frames into embeddings and store them in a vector database for semantic search and retrieval.
OpenVINO toolkit-LLM Integration: Deploys large language models efficiently on Intel® hardware for context-aware response generation.
Interactive Chat Interface: Web-based dashboard for querying video content with streaming responses and an inline preview of retrieved frames and captions.
Multi-Model Support: Configurable embedding models and LLM models with flexible model switching for different use cases and performance requirements.
Multi-Device Support: CPU and GPU device options for embedding generation and LLM inference, optimized for Intel® platforms.
REST API Endpoints: Programmatic access to embedding ingestion (
/api/embeddings) and chat queries (/api/chat) for integration with external systems.Streaming Responses: Real-time chat responses with full caption context and visual frame references for enhanced user understanding.
Deployment through Docker Compose tool: Containerized stack for simplified setup and deployment across different environments.
Use Cases#
Video Content Search and Discovery: Build searchable knowledge bases from surveillance, educational, or archival videos to find relevant scenes (or frames) and information quickly using natural language queries.
Real-time Video Analytics with Q&A: Monitor live video feeds with the ability to ask questions about the video content and receive answers grounded in actual video captions and context.
Accessibility and Content Understanding: Generate and query video captions to make the video content more accessible, and enable users to understand the video content without watching the full stream.
Intelligent Security and Safety: Deploy RAG-backed chatbots for security monitoring workflows to answer questions about events, activities, and anomalies detected in surveillance video streams.