Order Accuracy: How It Works#

Table of Contents#

System Overview
Architecture Diagrams
Component Details
Data Flow
Production Features

System Overview#

The Order Accuracy platform is an enterprise AI vision system designed for real-time order validation in quick-service restaurant (QSR) environments. The system uses Vision Language Models (VLM) to analyze images or video feeds, automatically identifying items and validating them against order data.

Key Features#

VLM-Powered Detection: Uses Qwen2.5-VL-7B for accurate item identification
Intel Hardware Optimization: Optimized for Intel CPUs and GPUs via OpenVINO
Dual Application Support: Dine-In (image-based) and Take-Away (video stream-based)
Semantic Matching: Fuzzy matching for item name variations
Real-time Processing: Sub-15-second validation for operational efficiency
Containerized Deployment: Docker-based deployment with microservices architecture

Architecture Diagrams#

Platform Architecture#

        graph TB
    subgraph "Order Accuracy Platform"
        subgraph "Dine-In Application"
            DUI[Gradio UI :7861]
            DAPI[FastAPI API :8083]
            DVLM[VLM Client]
            DSEM[Semantic Client]
        end

        subgraph "Take-Away Application"
            TUI[Gradio UI :7860]
            TAPI[FastAPI API :8080]
            TSW[Station Workers]
            TVS[VLM Scheduler]
            TFS[Frame Selector]
        end

        subgraph "Shared Services"
            OVMS[OVMS VLM<br>Qwen2.5-VL-7B]
            SEM[Semantic Service]
            MINIO[MinIO Storage]
        end
    end

    DUI --> DAPI
    DAPI --> DVLM
    DAPI --> DSEM
    DVLM --> OVMS
    DSEM --> SEM

    TUI --> TAPI
    TAPI --> TSW
    TSW --> TFS
    TSW --> TVS
    TVS --> OVMS
    TFS --> MINIO

Dine-In Architecture#

        flowchart TB
    subgraph DINEIN["Dine-In Order Accuracy"]
        DUI["Gradio UI<br/>(Port 7861)"] --> DAPI["FastAPI API<br/>(Port 8083)"]
        DAPI --> DVLM["VLM Client<br/>(Circuit Breaker)"]
        DAPI --> DSEM["Semantic Client<br/>(Circuit Breaker)"]
        DAPI --> DVS["Validation Service"]
        DVS --> DMET["Metrics Collector"]
    end

    DVLM --> OVMS["OVMS VLM<br/>(Qwen2.5-VL)"]
    DSEM --> SEM["Semantic Service"]

Take-Away Architecture#

        flowchart TB
  subgraph SYS["Take-Away Order Accuracy"]
    direction LR
    rtsp["RTSP Video Streams<br/>(GStreamer)"] --> oas["Order Accuracy Service<br/>(EasyOCR)"]
    oas --> minio["MinIO<br/>(Frame Storage)"]
    minio --> selector["Frame Selector<br/>(YOLO11n-CPU)"]
    selector -->|top 3 frames| scheduler["VLM Scheduler<br/>(ThreadPool)"]
    selector -->|top 3 frames| validation["Validation Agent"]
    scheduler --> ovms["OVMS VLM<br/>(Qwen2.5-VL, GPU-INT8)"]
    validation --> semantic["Semantic Service"]
  end

  ovms --> ui["Gradio UI<br/>(Interface)"]
  semantic --> ui

Component Details#

Core Components#

1. VLM Backend (OVMS)#

OpenVINO Model Server hosting Qwen2.5-VL-7B for vision-language inference.

Features:

OpenAI-compatible API (/v3/chat/completions)
INT8 quantization for optimized performance
GPU acceleration via Intel/NVIDIA hardware
Shared model instance for both applications

API Usage:

response = requests.post(
    f"{OVMS_ENDPOINT}/v3/chat/completions",
    json={
        "model": "Qwen/Qwen2.5-VL-7B-Instruct",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}}
                ]
            }
        ]
    }
)

2. Semantic Comparison Service#

AI-powered semantic matching microservice for intelligent item comparison.

Matching Strategies:

Exact: Direct string comparison
Semantic: Vector similarity using sentence-transformers
Hybrid: Exact first, then semantic fallback

Example Matches:

“Big Mac” ↔ “Maharaja Mac” (regional name variant)
“green apple” ↔ “apple” (partial match)
“large fries” ↔ “french fries large” (word reordering)

3. Frame Selector Service (Take-Away)#

YOLO-based intelligent frame selection for optimal VLM input.

Process:

Receive raw video frames from GStreamer pipeline
Run YOLO object detection on each frame
Score frames by item visibility and clarity
Select top K frames per order
Store selected frames in MinIO

4. VLM Scheduler (Take-Away)#

Request batching scheduler optimizing OVMS throughput.

Batching Strategy:

Time Window: 50-100ms collection period
Max Batch Size: Configurable (default: 16)
Fair Scheduling: Round-robin across workers
Response Routing: Match responses to original requesters

Docker Services#

Dine-In Services#

Container	Ports	Description
`dinein_app`	7861, 8083	Main application (Gradio + FastAPI)
`dinein_ovms_vlm`	8002	Vision-Language Model server
`dinein_semantic_service`	8081	Semantic text matching
`metrics-collector`	8084	System metrics aggregation

Take-Away Services#

Container	Ports	Description
`takeaway_app`	7860, 8080	Main application (Gradio + FastAPI)
`ovms-vlm`	8001	Vision-Language Model server
`frame-selector`	8085	YOLO-based frame selection
`semantic-service`	8081	Semantic text matching
`minio`	9000, 9001	S3-compatible storage
`rtsp-streamer`	8554	RTSP stream simulator (testing)

Data Flow#

Dine-In Validation Pipeline#

Image Processing: Raw Image → Auto-Orient → Resize (672px) → Enhance → Sharpen → JPEG Compress (82%) → Base64 Encode
VLM Inference: Prompt: “Analyze this food plate image…” + Inventory list for context → OVMS POST /v3/chat/completions → Parse JSON response for detected items
Semantic Matching: For each expected item:
- Find best match in detected items (similarity > 0.7)
- Track: matched, missing, extra, quantity mismatches

Result Aggregation:

{
  "order_complete": true/false,
  "accuracy_score": 0.0-1.0,
  "missing_items": [...],
  "extra_items": [...],
  "metrics": { "latency": [...], "tps": [...], "utilization": [...] }
}

Take-Away Processing Pipeline#

Video Capture: RTSP Camera → GStreamer Pipeline → Frame Buffer
Frame Selection:
- Frame Selector (YOLO):
  - Object detection on raw frames
  - Score frames by item visibility
  - Select top K frames per order
  - Store selected frames in MinIO
VLM Processing:
- VLM Scheduler → OVMS (Qwen2.5-VL):
  - Batch frames by time window
  - Send to OVMS with detection prompt
  - Parse structured item response
Order Validation:
- Validation Agent:
  - Compare detected items with expected order
  - Exact match → Semantic match → Flag mismatch
  - Generate validation result
Result Output:
- { “matched”: […], “missing”: […], “extra”: […] }

Production Features#

Circuit Breaker Pattern#

Prevents cascading failures when external services are unhealthy.

        flowchart LR
    CLOSED["CLOSED"]
    OPEN["OPEN"]
    HALFOPEN["HALF-OPEN"]

    CLOSED -- "5 consecutive failures" --> OPEN
    OPEN -- "30s timeout" --> HALFOPEN
    HALFOPEN -- "2 successes" --> CLOSED
    HALFOPEN -- "1 failure" --> OPEN

Configuration:

VLM Client: 5 failures → OPEN, 30s recovery → HALF-OPEN
Semantic Client: 15s recovery timeout (faster than VLM)

Connection Pooling#

# VLM Client Pool Configuration
limits = httpx.Limits(
    max_keepalive_connections=20,
    max_connections=50,
    keepalive_expiry=30.0
)
timeout = httpx.Timeout(
    connect=10.0,
    read=300.0,   # Extended for VLM inference
    write=10.0,
    pool=10.0
)

Bounded Cache (LRU)#

Thread-safe LRU cache with automatic eviction to prevent memory exhaustion:

Maximum 10,000 entries
Automatic eviction of oldest entries when full
Thread-safe operations with locking

Station Worker Reliability (Take-Away)#

Feature	Implementation
GStreamer Pipeline	RTSP → H.264 decode → Frame capture
Circuit Breaker	5 failures in 5 min → 30s cooldown
Exponential Backoff	2s → 4s → 8s → … → 60s max
Stall Detection	No frames for 5 min triggers restart
Health Monitoring	Frame rate, pipeline state tracking

Performance Characteristics#

Metric	Dine-In	Take-Away
End-to-End Latency	8-15 seconds	Real-time stream
VLM Inference	5-10 seconds	5-10 seconds (batched)
Semantic Matching	50-200ms	50-200ms
Throughput	~4-6 req/min	Multiple concurrent streams
GPU Utilization	60-80%	70-90% (parallel mode)