# Performance Testing & Benchmarking

Test your Order Accuracy pipeline performance on various hardware configurations. This guide covers everything from quick performance checks to comprehensive system capacity testing.

## Quick Start (5 minutes)

**Goal**: Run a basic performance test to verify your system works correctly

### 1. Initialize Performance Tools

```bash
make update-submodules
```

### 2. Run Quick Benchmark

::::{tab-set}
:::{tab-item} **Dine-In**
:sync: dine-in 

```bash
cd dine-in
make benchmark
```

:::
:::{tab-item} **Take-Away**
:sync: take-away 

```bash
cd take-away
make benchmark
```

:::
::::

**What this does:**

- Tests GPU/CPU performance for order validation
- Measures end-to-end latency
- Generates performance metrics
- Outputs results to `results/` directory

## Understanding Benchmark Types

::::::{tab-set}
:::::{tab-item} **Dine-In Benchmarks**
:sync: dine-in 

::::{tab-set}
:::{tab-item} **Single Request Benchmark**
:sync: single

```bash
make benchmark
```

Tests single image validation latency:

- Image preprocessing time
- VLM inference time
- Semantic matching time
- Total end-to-end latency

:::
:::{tab-item} **Stream Density Benchmark**
:sync: density 

```bash
make benchmark-density
```

Finds maximum concurrent requests the system can handle under latency constraints:

- Target latency threshold (configurable)
- Progressive load increase
- Identifies performance ceiling

:::
::::
:::::
:::::{tab-item} **Take-Away Benchmarks**
:sync: take-away 

::::{tab-set}
:::{tab-item} **Single Video Benchmark**
:sync: single

```bash
make benchmark
```

Tests end-to-end latency for single order validation:

- Video upload time
- Frame extraction time
- VLM inference latency
- Validation time
- Total processing time

:::
:::{tab-item} **Fixed Workers Benchmark**

```bash
make benchmark-oa BENCHMARK_WORKERS=4 BENCHMARK_DURATION=300
```

Tests system with fixed number of concurrent workers:

- Throughput (orders/minute)
- Latency percentiles (P50, P95, P99)
- GPU utilization
- Memory usage

:::
:::{tab-item} **Stream Density Benchmark**
:sync: density 

```bash
make benchmark-stream-density
```

Finds maximum sustainable worker count under latency constraints:

- Maximum concurrent workers
- Latency at each worker count
- Point of degradation
- Resource utilization at capacity


:::
::::
:::::
::::::

## Environment Variables Reference

::::{tab-set}
:::{tab-item} **Dine-In Configuration**
:sync: dine-in 

| Variable            | Default                 | Description                          |
| ------------------- | ----------------------- | ------------------------------------ |
| `TARGET_LATENCY_MS` | 15000                   | Target latency threshold (ms)        |
| `LATENCY_METRIC`    | avg                     | 'avg', 'p95', or 'max'               |
| `DENSITY_INCREMENT` | 1                       | Concurrent images per iteration      |
| `INIT_DURATION`     | 60                      | Warmup time (seconds)                |
| `MIN_REQUESTS`      | 3                       | Min requests before measuring        |
| `REQUEST_TIMEOUT`   | 300                     | Individual request timeout (seconds) |
| `API_ENDPOINT`      | `http://localhost:8083` | API endpoint URL                     |
| `RESULTS_DIR`       | `./results`             | Results output directory             |

:::
:::{tab-item} **Take-Away Configuration**
:sync: take-away 

| Variable             | Default | Description                       |
| -------------------- | ------- | --------------------------------- |
| `TARGET_LATENCY_MS`  | 25000   | Target latency threshold (ms)     |
| `LATENCY_METRIC`     | avg     | 'avg', 'p95', or 'max'            |
| `WORKER_INCREMENT`   | 1       | Workers added per iteration       |
| `INIT_DURATION`      | 10      | Warmup time (seconds)             |
| `MIN_TRANSACTIONS`   | 3       | Min transactions before measuring |
| `MAX_ITERATIONS`     | 50      | Max scaling iterations            |
| `MAX_WAIT_SEC`       | 600     | Max wait per iteration (seconds)  |
| `BENCHMARK_WORKERS`  | 1       | Number of workers (fixed mode)    |
| `BENCHMARK_DURATION` | 60      | Test duration (seconds)           |

:::
::::

## Hardware Testing Commands

### GPU Performance Testing

::::{tab-set}
:::{tab-item} **Dine-In**
:sync: dine-in 

```bash
# Ensure GPU device is configured in .env
# OPENVINO_DEVICE=GPU
make benchmark
```

:::
:::{tab-item} **Take-Away**
:sync: take-away 

```bash
# Configure GPU in .env
# OPENVINO_DEVICE=GPU
make benchmark-oa BENCHMARK_WORKERS=4
```

:::
::::

### Multi-Worker Stress Testing (Take-Away)

```bash
# Test with 2 parallel workers
make up-parallel WORKERS=2
make benchmark-oa BENCHMARK_WORKERS=2

# High stress test with 8 workers
make up-parallel WORKERS=8
make benchmark-oa BENCHMARK_WORKERS=8
```

### Progressive Load Testing

```bash
# Automatically find maximum sustainable workers
make benchmark-stream-density \
  BENCHMARK_TARGET_LATENCY_MS=25000 \
  BENCHMARK_WORKER_INCREMENT=1 \
  BENCHMARK_MAX_ITERATIONS=20
```

## Viewing Results

::::{tab-set}
:::{tab-item} **Dine-In Results**
:sync: dine-in 

```bash
# View density benchmark results
make benchmark-density-results

# View raw results
cat results/benchmark_results.json
ls -la results/
```

:::
:::{tab-item} **Take-Away Results**
:sync: take-away 

```bash
# View benchmark results
make benchmark-oa-results

# View density results
cat results/stream_density_results.json
ls -la results/
```

:::
::::

### Consolidate Metrics

```bash
make consolidate-metrics
cat results/metrics_summary.csv
```

## Expected Performance

### Typical Latency Ranges

| Operation               | Dine-In   | Take-Away       |
| ----------------------- | --------- | --------------- |
| **Image Preprocessing** | 100-500ms | N/A             |
| **Frame Selection**     | N/A       | 200-500ms       |
| **VLM Inference**       | 5-10s     | 5-10s           |
| **Semantic Matching**   | 50-200ms  | 50-200ms        |
| **Total End-to-End**    | 8-15s     | 8-15s per order |

### Hardware Impact

| Configuration      | Typical Performance   |
| ------------------ | --------------------- |
| **CPU Only**       | 15-25s per validation |
| **Intel iGPU**     | 8-15s per validation  |
| **Intel Arc dGPU** | 5-10s per validation  |
| **NVIDIA RTX**     | 4-8s per validation   |

### Throughput Expectations

| Mode                               | Expected Throughput |
| ---------------------------------- | ------------------- |
| **Dine-In Single**                 | 4-6 orders/minute   |
| **Take-Away Single**               | 4-6 orders/minute   |
| **Take-Away Parallel (4 workers)** | 16-24 orders/minute |
| **Take-Away Parallel (8 workers)** | 30-40 orders/minute |

## Optimization Tips

### GPU Utilization

- Monitor GPU usage with `nvidia-smi -l 1` or `intel_gpu_top`
- Target 70-90% GPU utilization for optimal throughput
- If GPU is underutilized, increase worker count

### Memory Management

- Monitor container memory with `docker stats`
- VLM models require 8-16GB GPU memory
- Reduce batch size if out-of-memory errors occur

### Network Optimization (Take-Away)

- Use wired connections for RTSP streams
- Ensure 1Gbps+ network bandwidth per camera
- Consider local video storage for testing

### Latency Reduction

- Use INT8 model quantization
- Enable HTTP/2 for API connections
- Pre-warm VLM model before benchmarking

## Troubleshooting Performance Issues

### Low FPS / High Latency

- Check GPU driver installation
- Verify OPENVINO_DEVICE setting in .env
- Reduce image resolution or batch size
- Check for thermal throttling

### VLM Timeout Errors

- Increase API_TIMEOUT in .env
- Check GPU memory availability
- Consider using smaller model precision

### Memory Exhaustion

- Reduce number of parallel workers
- Lower batch size settings
- Monitor with `docker stats`

### Inconsistent Results

- Increase warmup duration (INIT_DURATION)
- Increase minimum transactions (MIN_TRANSACTIONS)
- Run multiple benchmark iterations