Order Accuracy: How It Works#
Table of Contents#
System Overview#
The Order Accuracy platform is an enterprise AI vision system designed for real-time order validation in quick-service restaurant (QSR) environments. The system uses Vision Language Models (VLM) to analyze images or video feeds, automatically identifying items and validating them against order data.
Key Features#
VLM-Powered Detection: Uses Qwen2.5-VL-7B for accurate item identification
Intel Hardware Optimization: Optimized for Intel CPUs and GPUs via OpenVINO
Dual Application Support: Dine-In (image-based) and Take-Away (video stream-based)
Semantic Matching: Fuzzy matching for item name variations
Real-time Processing: Sub-15-second validation for operational efficiency
Containerized Deployment: Docker-based deployment with microservices architecture
Architecture Diagrams#
Platform Architecture#
graph TB
subgraph "Order Accuracy Platform"
subgraph "Dine-In Application"
DUI[Gradio UI :7861]
DAPI[FastAPI API :8083]
DVLM[VLM Client]
DSEM[Semantic Client]
end
subgraph "Take-Away Application"
TUI[Gradio UI :7860]
TAPI[FastAPI API :8080]
TSW[Station Workers]
TVS[VLM Scheduler]
TFS[Frame Selector]
end
subgraph "Shared Services"
OVMS[OVMS VLM<br>Qwen2.5-VL-7B]
SEM[Semantic Service]
MINIO[MinIO Storage]
end
end
DUI --> DAPI
DAPI --> DVLM
DAPI --> DSEM
DVLM --> OVMS
DSEM --> SEM
TUI --> TAPI
TAPI --> TSW
TSW --> TFS
TSW --> TVS
TVS --> OVMS
TFS --> MINIO
Dine-In Architecture#
flowchart TB
subgraph DINEIN["Dine-In Order Accuracy"]
DUI["Gradio UI<br/>(Port 7861)"] --> DAPI["FastAPI API<br/>(Port 8083)"]
DAPI --> DVLM["VLM Client<br/>(Circuit Breaker)"]
DAPI --> DSEM["Semantic Client<br/>(Circuit Breaker)"]
DAPI --> DVS["Validation Service"]
DVS --> DMET["Metrics Collector"]
end
DVLM --> OVMS["OVMS VLM<br/>(Qwen2.5-VL)"]
DSEM --> SEM["Semantic Service"]
Take-Away Architecture#
flowchart TB
subgraph SYS["Take-Away Order Accuracy"]
direction LR
rtsp["RTSP Video Streams<br/>(GStreamer)"] --> oas["Order Accuracy Service<br/>(EasyOCR)"]
oas --> minio["MinIO<br/>(Frame Storage)"]
minio --> selector["Frame Selector<br/>(YOLO11n-CPU)"]
selector -->|top 3 frames| scheduler["VLM Scheduler<br/>(ThreadPool)"]
selector -->|top 3 frames| validation["Validation Agent"]
scheduler --> ovms["OVMS VLM<br/>(Qwen2.5-VL, GPU-INT8)"]
validation --> semantic["Semantic Service"]
end
ovms --> ui["Gradio UI<br/>(Interface)"]
semantic --> ui
Component Details#
Core Components#
1. VLM Backend (OVMS)#
OpenVINO Model Server hosting Qwen2.5-VL-7B for vision-language inference.
Features:
OpenAI-compatible API (
/v3/chat/completions)INT8 quantization for optimized performance
GPU acceleration via Intel/NVIDIA hardware
Shared model instance for both applications
API Usage:
response = requests.post(
f"{OVMS_ENDPOINT}/v3/chat/completions",
json={
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}}
]
}
]
}
)
2. Semantic Comparison Service#
AI-powered semantic matching microservice for intelligent item comparison.
Matching Strategies:
Exact: Direct string comparison
Semantic: Vector similarity using sentence-transformers
Hybrid: Exact first, then semantic fallback
Example Matches:
“Big Mac” ↔ “Maharaja Mac” (regional name variant)
“green apple” ↔ “apple” (partial match)
“large fries” ↔ “french fries large” (word reordering)
3. Frame Selector Service (Take-Away)#
YOLO-based intelligent frame selection for optimal VLM input.
Process:
Receive raw video frames from GStreamer pipeline
Run YOLO object detection on each frame
Score frames by item visibility and clarity
Select top K frames per order
Store selected frames in MinIO
4. VLM Scheduler (Take-Away)#
Request batching scheduler optimizing OVMS throughput.
Batching Strategy:
Time Window: 50-100ms collection period
Max Batch Size: Configurable (default: 16)
Fair Scheduling: Round-robin across workers
Response Routing: Match responses to original requesters
Docker Services#
Dine-In Services#
Container |
Ports |
Description |
|---|---|---|
|
7861, 8083 |
Main application (Gradio + FastAPI) |
|
8002 |
Vision-Language Model server |
|
8081 |
Semantic text matching |
|
8084 |
System metrics aggregation |
Take-Away Services#
Container |
Ports |
Description |
|---|---|---|
|
7860, 8080 |
Main application (Gradio + FastAPI) |
|
8001 |
Vision-Language Model server |
|
8085 |
YOLO-based frame selection |
|
8081 |
Semantic text matching |
|
9000, 9001 |
S3-compatible storage |
|
8554 |
RTSP stream simulator (testing) |
Data Flow#
Dine-In Validation Pipeline#
Image Processing: Raw Image → Auto-Orient → Resize (672px) → Enhance → Sharpen → JPEG Compress (82%) → Base64 Encode
VLM Inference: Prompt: “Analyze this food plate image…” + Inventory list for context → OVMS POST
/v3/chat/completions→ Parse JSON response for detected itemsSemantic Matching: For each expected item:
Find best match in detected items (similarity > 0.7)
Track: matched, missing, extra, quantity mismatches
Result Aggregation:
{ "order_complete": true/false, "accuracy_score": 0.0-1.0, "missing_items": [...], "extra_items": [...], "metrics": { "latency": [...], "tps": [...], "utilization": [...] } }
Take-Away Processing Pipeline#
Video Capture: RTSP Camera → GStreamer Pipeline → Frame Buffer
Frame Selection:
Frame Selector (YOLO):
Object detection on raw frames
Score frames by item visibility
Select top K frames per order
Store selected frames in MinIO
VLM Processing:
VLM Scheduler → OVMS (Qwen2.5-VL):
Batch frames by time window
Send to OVMS with detection prompt
Parse structured item response
Order Validation:
Validation Agent:
Compare detected items with expected order
Exact match → Semantic match → Flag mismatch
Generate validation result
Result Output:
{ “matched”: […], “missing”: […], “extra”: […] }
Production Features#
Circuit Breaker Pattern#
Prevents cascading failures when external services are unhealthy.
flowchart LR
CLOSED["CLOSED"]
OPEN["OPEN"]
HALFOPEN["HALF-OPEN"]
CLOSED -- "5 consecutive failures" --> OPEN
OPEN -- "30s timeout" --> HALFOPEN
HALFOPEN -- "2 successes" --> CLOSED
HALFOPEN -- "1 failure" --> OPEN
Configuration:
VLM Client: 5 failures → OPEN, 30s recovery → HALF-OPEN
Semantic Client: 15s recovery timeout (faster than VLM)
Connection Pooling#
# VLM Client Pool Configuration
limits = httpx.Limits(
max_keepalive_connections=20,
max_connections=50,
keepalive_expiry=30.0
)
timeout = httpx.Timeout(
connect=10.0,
read=300.0, # Extended for VLM inference
write=10.0,
pool=10.0
)
Bounded Cache (LRU)#
Thread-safe LRU cache with automatic eviction to prevent memory exhaustion:
Maximum 10,000 entries
Automatic eviction of oldest entries when full
Thread-safe operations with locking
Station Worker Reliability (Take-Away)#
Feature |
Implementation |
|---|---|
GStreamer Pipeline |
RTSP → H.264 decode → Frame capture |
Circuit Breaker |
5 failures in 5 min → 30s cooldown |
Exponential Backoff |
2s → 4s → 8s → … → 60s max |
Stall Detection |
No frames for 5 min triggers restart |
Health Monitoring |
Frame rate, pipeline state tracking |
Performance Characteristics#
Metric |
Dine-In |
Take-Away |
|---|---|---|
End-to-End Latency |
8-15 seconds |
Real-time stream |
VLM Inference |
5-10 seconds |
5-10 seconds (batched) |
Semantic Matching |
50-200ms |
50-200ms |
Throughput |
~4-6 req/min |
Multiple concurrent streams |
GPU Utilization |
60-80% |
70-90% (parallel mode) |