# Troubleshooting

This guide covers common issues and solutions.

## Connection and Startup

### Connection Refused on Port 9090

**Symptom**: `curl: (7) Failed to connect to localhost port 9090: Connection refused`

**Check:**

1. Container is running: `docker ps | grep metrics-manager`
2. Port is bound: `docker port metrics-manager`
3. Service is healthy: `docker logs metrics-manager | tail -20`

**Solution:**

```bash
# Check if container is running
docker ps

# If not running, start it
docker compose up -d

# Check logs for errors
docker logs metrics-manager
```

---

## Metrics Not Appearing

### Custom Metric Not Appearing on `/api/v1/metrics`

**Symptom**: Metric accepted (201 response) but doesn't appear in query.

**Check:**

```bash
# Verify metric was accepted
curl -X POST http://localhost:9090/api/v1/metrics/simple \
  -H "Content-Type: application/json" \
  -d '{"name": "test_metric", "value": 42}'

# Query immediately
curl http://localhost:9090/api/v1/metrics | jq '.metrics | keys'
```

**Causes and Solutions:**

| Cause                         | Solution                                                                  |
| ----------------------------- | ------------------------------------------------------------------------- |
| Metric expired (default 300s) | Set `METRICS_RETENTION_SECONDS=3600` to keep longer                       |
| Memory limit reached          | Set `MAX_METRICS_IN_MEMORY=500000` to increase limit                      |
| Telegraf :8186 unreachable    | Check Telegraf is running: `docker logs metrics-manager \| grep telegraf` |
| Invalid metric format         | Check request format matches one of the four supported formats            |

---

### Custom Metric Not Appearing in Prometheus (`:9273/metrics`)

**Symptom**: Custom metric appears in `/api/v1/metrics` but not in `:9273/metrics`.

**Root cause**: Metric hasn't been persisted to Telegraf yet (debounced 100ms by default).

**Solution:**

```bash
# Wait a moment and check again
sleep 1
curl http://localhost:9273/metrics | grep my_metric
```

Or reduce debounce delay:

```bash
FILE_PERSIST_DEBOUNCE_MS=10 docker compose up
```

---

### Custom Metric Not Appearing in SSE Stream

**Symptom**: Metric in Prometheus endpoint but not in `/metrics/stream`.

**Root cause**: SSE client not polling frequently enough (default 500ms).

**Solution:**

1. Reduce polling interval:

   ```bash
   PROMETHEUS_POLLER_INTERVAL_MS=100 docker compose up
   ```

2. Wait for next polling cycle:
   ```bash
   # Default 500ms polling interval
   sleep 1
   curl -N -H "Accept: text/event-stream" http://localhost:9090/metrics/stream
   ```

---

## GPU and NPU Metrics

### No GPU Metrics

**Symptom**: No `gpu_*` or `engine_usage_*` metrics in `:9273/metrics`.

**Check:**

```bash
# Verify Intel GPU is present
lspci | grep -i intel | grep -i graphics

# Check qmassa FIFO exists
docker exec metrics-manager ls -la /app/qmassa.fifo

# Check if qmassa is running
docker exec metrics-manager supervisorctl -c /etc/supervisor/supervisord.conf status qmassa

# View qmassa logs
docker logs metrics-manager | grep qmassa
```

**Solutions:**

| Issue                     | Fix                                                                                   |
| ------------------------- | ------------------------------------------------------------------------------------- |
| No Intel GPU              | Expected. qmassa logs `No DRM devices found` and exits. Other metrics continue.       |
| `/dev/dri` not accessible | Ensure `--device /dev/dri` in `docker run` or `devices:` in `compose.yaml`            |
| Old GPU drivers           | Update GPU drivers: `sudo apt update && sudo apt install intel-media-driver` (Ubuntu) |
| qmassa process crashed    | Check logs: `docker logs metrics-manager \| grep -i qmassa`                           |

---

### No NPU Metrics

**Symptom**: No `npu_power`, `npu_frequency`, etc. in `:9273/metrics`.

**Check:**

```bash
# Verify Intel NPU driver is loaded
ls /sys/bus/pci/drivers/intel_vpu/

# Check PMT sysfs interface
ls /sys/class/intel_pmt/

# Verify privileged mode
docker inspect metrics-manager | grep Privileged

# Check npu_reader logs
docker exec metrics-manager cat /app/npu_reader_trace.log

# View supervisord status
docker exec metrics-manager supervisorctl -c /etc/supervisor/supervisord.conf status
```

**Solutions:**

| Issue                         | Fix                                                                           |
| ----------------------------- | ----------------------------------------------------------------------------- |
| `intel_vpu` driver not loaded | Load it: `sudo modprobe intel_vpu` (host)                                     |
| Container not privileged      | Run with `privileged: true` (Docker) or `docker run --privileged`             |
| `/sys` not accessible         | Mount with `--privileged` or `-v /sys:/sys:ro`                                |
| Old hardware (pre-PTL)        | `npu_memory_mb` returns `-1` — expected on MTL/ARL                            |
| No NPU hardware               | Expected. Reader logs warning, then enters idle mode. Other metrics continue. |

---

## Telegraf Issues

### Telegraf Metrics Empty

**Symptom**: `:9273/metrics` returns 404 or empty response.

**Check:**

```bash
# Verify Telegraf is running
docker exec metrics-manager supervisorctl -c /etc/supervisor/supervisord.conf status telegraf

# Check Telegraf logs
docker logs metrics-manager | grep -i telegraf

# Check Telegraf config syntax
docker exec metrics-manager telegraf -config /etc/telegraf/telegraf.conf -test
```

**Common issues:**

| Issue                            | Solution                                                   |
| -------------------------------- | ---------------------------------------------------------- |
| Telegraf configuration error     | Check logs: `docker logs metrics-manager \| grep -i error` |
| CPU metrics disabled             | Verify `[[inputs.cpu]]` block in `telegraf.conf`           |
| Prometheus output not configured | Verify `[[outputs.prometheus_client]]` in `telegraf.conf`  |

---

## Rate Limiting

### Rate Limited (429 Too Many Requests)

**Symptom**: `HTTP/1.1 429 Too Many Requests`

**Check:**

```bash
# Current rate limit config
curl http://localhost:9090/api/v1/stats | jq '.requests_total, .errors_total'
```

**Solution:**

Increase rate limits in `.env`:

```bash
RATE_LIMIT_REQUESTS_PER_MINUTE=5000
RATE_LIMIT_BURST=500
```

Then restart:

```bash
docker compose down && docker compose up -d
```

---

## Logging and Debugging

### Find Correlation IDs in Logs

```bash
# Search logs by correlation ID
docker logs metrics-manager 2>&1 | grep "correlation_id.*abc-123"

# Or use jq for JSON logs
docker logs metrics-manager 2>&1 | jq 'select(.correlation_id == "abc-123")'
```

### Enable Debug Logging

```bash
LOG_LEVEL=DEBUG docker compose up
```

This logs all requests, responses, and internal state.

### Check Service Health

```bash
# Basic health
curl http://localhost:9090/health | jq .

# Internal stats
curl http://localhost:9090/api/v1/stats | jq .

# Detailed health with store info
curl http://localhost:9090/api/health | jq .
```

---

## Graceful Shutdown

Shutdown is handled by uvicorn/FastAPI on SIGTERM and SIGINT:

```bash
# Graceful shutdown (waits up to 60s for in-flight requests)
docker compose down

# Force kill (less safe)
docker compose kill
```

---

## Memory Protection

Automatic eviction prevents memory exhaustion:

- Default limit: 100,000 metrics in memory
- Oldest metrics evicted when limit reached
- Configure via `MAX_METRICS_IN_MEMORY`

**To see memory usage:**

```bash
# Docker stats
docker stats metrics-manager

# Inside container
docker exec metrics-manager ps aux | grep uvicorn
```

---

## Docker-Specific Issues

### Port Already in Use

**Symptom**: `docker: Error response from daemon: driver failed programming external connectivity on endpoint metrics-manager`

**Solution:**

1. Change ports in `.env`:

   ```bash
   HOST_METRICS_PORT=19090
   HOST_TELEGRAF_PORT=19273
   HOST_TELEGRAF_HTTP_PORT=18186
   ```

2. Or find and stop the process using the port:

   ```bash
   lsof -i :9090  # Find process on port 9090
   kill <PID>     # Kill the process
   ```

### Insufficient Disk Space for Build

**Symptom**: `docker build: Build failed — no space left on device`

**Solution:**

```bash
# Clean up Docker build cache
docker builder prune

# Or remove all unused images
docker system prune -a
```

---

## Testing All Endpoints

Use this script to smoke-test all major endpoints:

```bash
#!/bin/bash
set -e

echo "=== 1. Health ==="
curl -s http://localhost:9090/health | jq .

echo -e "\n=== 2. Push Simple Metric ==="
curl -s -X POST http://localhost:9090/api/v1/metrics/simple \
  -H "Content-Type: application/json" \
  -d '{"name": "test_metric", "value": 123.45}' | jq .

echo -e "\n=== 3. Query Metrics ==="
curl -s http://localhost:9090/api/v1/metrics/latest | jq '.metrics | keys'

echo -e "\n=== 4. Prometheus Format ==="
curl -s http://localhost:9090/metrics | head

echo -e "\n=== 5. Telegraf Endpoint ==="
curl -s http://localhost:9273/metrics | head

echo -e "\n=== 6. SSE Stream (first event) ==="
timeout 2 curl -N -H "Accept: text/event-stream" http://localhost:9090/metrics/stream || true

echo -e "\n=== All tests passed! ==="
```

---

## Getting Help

1. **Check logs**: `docker logs metrics-manager`
2. **Check service health**: `curl http://localhost:9090/api/health`
3. **Increase log level**: `LOG_LEVEL=DEBUG docker compose up`
4. **Search GitHub issues**: [Edge AI Libraries Issues Page](https://github.com/open-edge-platform/edge-ai-libraries/issues) (use `metrics-manager` label)
5. **Manual endpoint testing**: Use curl commands from [API Reference](./api-reference.md)

## Supporting Resources

- [API Reference](./api-reference.md)
- [Environment Variables](./get-started/environment-variables.md)
- [How It Works](./how-it-works.md)
- [Get Started Guide](./get-started.md)

## License

Copyright (C) 2025-2026 Intel Corporation

SPDX-License-Identifier: Apache-2.0