# Order Accuracy: How It Works ## Table of Contents 1. [System Overview](#system-overview) 2. [Architecture Diagrams](#architecture-diagrams) 3. [Component Details](#component-details) 4. [Data Flow](#data-flow) 5. [Production Features](#production-features) ## System Overview The Order Accuracy platform is an enterprise AI vision system designed for real-time order validation in quick-service restaurant (QSR) environments. The system uses Vision Language Models (VLM) to analyze images or video feeds, automatically identifying items and validating them against order data. ### Key Features - **VLM-Powered Detection**: Uses Qwen2.5-VL-7B for accurate item identification - **Intel Hardware Optimization**: Optimized for Intel CPUs and GPUs via OpenVINO - **Dual Application Support**: Dine-In (image-based) and Take-Away (video stream-based) - **Semantic Matching**: Fuzzy matching for item name variations - **Real-time Processing**: Sub-15-second validation for operational efficiency - **Containerized Deployment**: Docker-based deployment with microservices architecture ## Architecture Diagrams ### Platform Architecture ```mermaid graph TB subgraph "Order Accuracy Platform" subgraph "Dine-In Application" DUI[Gradio UI :7861] DAPI[FastAPI API :8083] DVLM[VLM Client] DSEM[Semantic Client] end subgraph "Take-Away Application" TUI[Gradio UI :7860] TAPI[FastAPI API :8080] TSW[Station Workers] TVS[VLM Scheduler] TFS[Frame Selector] end subgraph "Shared Services" OVMS[OVMS VLM
Qwen2.5-VL-7B] SEM[Semantic Service] MINIO[MinIO Storage] end end DUI --> DAPI DAPI --> DVLM DAPI --> DSEM DVLM --> OVMS DSEM --> SEM TUI --> TAPI TAPI --> TSW TSW --> TFS TSW --> TVS TVS --> OVMS TFS --> MINIO ``` ### Dine-In Architecture ```mermaid flowchart TB subgraph DINEIN["Dine-In Order Accuracy"] DUI["Gradio UI
(Port 7861)"] --> DAPI["FastAPI API
(Port 8083)"] DAPI --> DVLM["VLM Client
(Circuit Breaker)"] DAPI --> DSEM["Semantic Client
(Circuit Breaker)"] DAPI --> DVS["Validation Service"] DVS --> DMET["Metrics Collector"] end DVLM --> OVMS["OVMS VLM
(Qwen2.5-VL)"] DSEM --> SEM["Semantic Service"] ``` ### Take-Away Architecture ```mermaid flowchart TB subgraph SYS["Take-Away Order Accuracy"] direction LR rtsp["RTSP Video Streams
(GStreamer)"] --> oas["Order Accuracy Service
(EasyOCR)"] oas --> minio["MinIO
(Frame Storage)"] minio --> selector["Frame Selector
(YOLO11n-CPU)"] selector -->|top 3 frames| scheduler["VLM Scheduler
(ThreadPool)"] selector -->|top 3 frames| validation["Validation Agent"] scheduler --> ovms["OVMS VLM
(Qwen2.5-VL, GPU-INT8)"] validation --> semantic["Semantic Service"] end ovms --> ui["Gradio UI
(Interface)"] semantic --> ui ``` ## Component Details ### Core Components #### 1. VLM Backend (OVMS) OpenVINO Model Server hosting Qwen2.5-VL-7B for vision-language inference. **Features:** - OpenAI-compatible API (`/v3/chat/completions`) - INT8 quantization for optimized performance - GPU acceleration via Intel/NVIDIA hardware - Shared model instance for both applications **API Usage:** ```python response = requests.post( f"{OVMS_ENDPOINT}/v3/chat/completions", json={ "model": "Qwen/Qwen2.5-VL-7B-Instruct", "messages": [ { "role": "user", "content": [ {"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}} ] } ] } ) ``` #### 2. Semantic Comparison Service AI-powered semantic matching microservice for intelligent item comparison. **Matching Strategies:** - **Exact**: Direct string comparison - **Semantic**: Vector similarity using sentence-transformers - **Hybrid**: Exact first, then semantic fallback **Example Matches:** - "Big Mac" ↔ "Maharaja Mac" (regional name variant) - "green apple" ↔ "apple" (partial match) - "large fries" ↔ "french fries large" (word reordering) #### 3. Frame Selector Service (Take-Away) YOLO-based intelligent frame selection for optimal VLM input. **Process:** 1. Receive raw video frames from GStreamer pipeline 2. Run YOLO object detection on each frame 3. Score frames by item visibility and clarity 4. Select top K frames per order 5. Store selected frames in MinIO #### 4. VLM Scheduler (Take-Away) Request batching scheduler optimizing OVMS throughput. **Batching Strategy:** - Time Window: 50-100ms collection period - Max Batch Size: Configurable (default: 16) - Fair Scheduling: Round-robin across workers - Response Routing: Match responses to original requesters ### Docker Services #### Dine-In Services | Container | Ports | Description | | ------------------------- | ---------- | ----------------------------------- | | `dinein_app` | 7861, 8083 | Main application (Gradio + FastAPI) | | `dinein_ovms_vlm` | 8002 | Vision-Language Model server | | `dinein_semantic_service` | 8081 | Semantic text matching | | `metrics-collector` | 8084 | System metrics aggregation | #### Take-Away Services | Container | Ports | Description | | ------------------ | ---------- | ----------------------------------- | | `takeaway_app` | 7860, 8080 | Main application (Gradio + FastAPI) | | `ovms-vlm` | 8001 | Vision-Language Model server | | `frame-selector` | 8085 | YOLO-based frame selection | | `semantic-service` | 8081 | Semantic text matching | | `minio` | 9000, 9001 | S3-compatible storage | | `rtsp-streamer` | 8554 | RTSP stream simulator (testing) | ## Data Flow ### Dine-In Validation Pipeline 1. **Image Processing**: Raw Image → Auto-Orient → Resize (672px) → Enhance → Sharpen → JPEG Compress (82%) → Base64 Encode 2. **VLM Inference**: Prompt: "Analyze this food plate image..." + Inventory list for context → OVMS POST `/v3/chat/completions` → Parse JSON response for detected items 3. **Semantic Matching**: For each expected item: - Find best match in detected items (similarity > 0.7) - Track: matched, missing, extra, quantity mismatches 4. **Result Aggregation**: ```text { "order_complete": true/false, "accuracy_score": 0.0-1.0, "missing_items": [...], "extra_items": [...], "metrics": { "latency": [...], "tps": [...], "utilization": [...] } } ``` ### Take-Away Processing Pipeline 1. **Video Capture**: RTSP Camera → GStreamer Pipeline → Frame Buffer 2. **Frame Selection**: - Frame Selector (YOLO): - Object detection on raw frames - Score frames by item visibility - Select top K frames per order - Store selected frames in MinIO 3. **VLM Processing**: - VLM Scheduler → OVMS (Qwen2.5-VL): - Batch frames by time window - Send to OVMS with detection prompt - Parse structured item response 4. **Order Validation**: - Validation Agent: - Compare detected items with expected order - Exact match → Semantic match → Flag mismatch - Generate validation result 5. **Result Output**: - { "matched": [...], "missing": [...], "extra": [...] } ## Production Features ### Circuit Breaker Pattern Prevents cascading failures when external services are unhealthy. ```mermaid flowchart LR CLOSED["CLOSED"] OPEN["OPEN"] HALFOPEN["HALF-OPEN"] CLOSED -- "5 consecutive failures" --> OPEN OPEN -- "30s timeout" --> HALFOPEN HALFOPEN -- "2 successes" --> CLOSED HALFOPEN -- "1 failure" --> OPEN ``` **Configuration:** - VLM Client: 5 failures → OPEN, 30s recovery → HALF-OPEN - Semantic Client: 15s recovery timeout (faster than VLM) ### Connection Pooling ```python # VLM Client Pool Configuration limits = httpx.Limits( max_keepalive_connections=20, max_connections=50, keepalive_expiry=30.0 ) timeout = httpx.Timeout( connect=10.0, read=300.0, # Extended for VLM inference write=10.0, pool=10.0 ) ``` ### Bounded Cache (LRU) Thread-safe LRU cache with automatic eviction to prevent memory exhaustion: - Maximum 10,000 entries - Automatic eviction of oldest entries when full - Thread-safe operations with locking ### Station Worker Reliability (Take-Away) | Feature | Implementation | | ------------------- | ------------------------------------ | | GStreamer Pipeline | RTSP → H.264 decode → Frame capture | | Circuit Breaker | 5 failures in 5 min → 30s cooldown | | Exponential Backoff | 2s → 4s → 8s → ... → 60s max | | Stall Detection | No frames for 5 min triggers restart | | Health Monitoring | Frame rate, pipeline state tracking | ### Performance Characteristics | Metric | Dine-In | Take-Away | | ---------------------- | ------------ | --------------------------- | | **End-to-End Latency** | 8-15 seconds | Real-time stream | | **VLM Inference** | 5-10 seconds | 5-10 seconds (batched) | | **Semantic Matching** | 50-200ms | 50-200ms | | **Throughput** | ~4-6 req/min | Multiple concurrent streams | | **GPU Utilization** | 60-80% | 70-90% (parallel mode) |