Troubleshooting#

This guide covers common issues and solutions.

Connection and Startup#

Connection Refused on Port 9090#

Symptom: curl: (7) Failed to connect to localhost port 9090: Connection refused

Check:

Container is running: docker ps | grep metrics-manager
Port is bound: docker port metrics-manager
Service is healthy: docker logs metrics-manager | tail -20

Solution:

# Check if container is running
docker ps

# If not running, start it
docker compose up -d

# Check logs for errors
docker logs metrics-manager

Capabilities API#

Capabilities Endpoint Not Reachable#

Symptom: curl: (7) Failed to connect or non-200 response for /api/v1/capabilities.

Check:

curl -s http://localhost:9090/health | jq .
curl -s "http://localhost:9090/api/v1/capabilities?profile=minimal" | jq .

Solution:

Ensure the service is running and healthy.
Ensure host port mapping targets container port 9090.

Capabilities Response Missing Hardware Enrichment#

Symptom: Capabilities response is present, but model/vendor/memory-type/storage details are limited or unknown.

Check:

# Confirm basic Linux host visibility in container
docker exec metrics-manager ls /sys /proc

# Confirm capabilities support tools exist
docker exec metrics-manager dmidecode --version
docker exec metrics-manager lspci -nn | head

# Compare minimal vs expanded profile output
curl -s "http://localhost:9090/api/v1/capabilities?profile=minimal" | jq .platform
curl -s "http://localhost:9090/api/v1/capabilities?profile=expanded" | jq .devices

Common causes and fixes:

Cause	Fix
Host sysfs/proc not visible	Run with required mounts (`/sys`, `/run`) and host PID mode as documented in get-started guide
GPU devices not visible	Add `--device /dev/dri` (or compose `devices:`)
NPU sysfs path unavailable	Run privileged mode for NPU telemetry scenarios
Missing runtime tools (`dmidecode`, `lspci`)	Ensure image includes `dmidecode` and `pciutils`

Metrics Not Appearing#

Custom Metric Not Appearing on `/api/v1/metrics`#

Symptom: Metric accepted (201 response) but doesn’t appear in query.

Check:

# Verify metric was accepted
curl -X POST http://localhost:9090/api/v1/metrics/simple \
  -H "Content-Type: application/json" \
  -d '{"name": "test_metric", "value": 42}'

# Query immediately
curl http://localhost:9090/api/v1/metrics | jq '.metrics | keys'

Causes and Solutions:

Cause	Solution
Metric expired (default 300s)	Set `METRICS_RETENTION_SECONDS=3600` to keep longer
Memory limit reached	Set `MAX_METRICS_IN_MEMORY=500000` to increase limit
Telegraf :8186 unreachable	Check Telegraf is running: `docker logs metrics-manager \| grep telegraf`
Invalid metric format	Check request format matches one of the four supported formats

Custom Metric Not Appearing in Prometheus (`:9273/metrics`)#

Symptom: Custom metric appears in /api/v1/metrics but not in :9273/metrics.

Root cause: Metric hasn’t been persisted to Telegraf yet (debounced 100ms by default).

Solution:

# Wait a moment and check again
sleep 1
curl http://localhost:9273/metrics | grep my_metric

Or reduce debounce delay:

FILE_PERSIST_DEBOUNCE_MS=10 docker compose up

Custom Metric Not Appearing in SSE Stream#

Symptom: Metric in Prometheus endpoint but not in /metrics/stream.

Root cause: SSE client not polling frequently enough (default 500ms).

Solution:

Reduce polling interval:

PROMETHEUS_POLLER_INTERVAL_MS=100 docker compose up

Wait for next polling cycle:

# Default 500ms polling interval
sleep 1
curl -N -H "Accept: text/event-stream" http://localhost:9090/metrics/stream

GPU and NPU Metrics#

No GPU Metrics#

Symptom: No gpu_* or engine_usage_* metrics in :9273/metrics.

Check:

# Verify Intel GPU is present
lspci | grep -i intel | grep -i graphics

# Check qmassa FIFO exists
docker exec metrics-manager ls -la /app/qmassa.fifo

# Check if qmassa is running
docker exec metrics-manager supervisorctl -c /etc/supervisor/supervisord.conf status qmassa

# View qmassa logs
docker logs metrics-manager | grep qmassa

Solutions:

Issue	Fix
No Intel GPU	Expected. qmassa logs `No DRM devices found` and exits. Other metrics continue.
`/dev/dri` not accessible	Ensure `--device /dev/dri` in `docker run` or `devices:` in `compose.yaml`
Old GPU drivers	Update GPU drivers: `sudo apt update && sudo apt install intel-media-driver` (Ubuntu)
qmassa process crashed	Check logs: `docker logs metrics-manager \| grep -i qmassa`

No NPU Metrics#

Symptom: No npu_power, npu_frequency, etc. in :9273/metrics.

Check:

# Verify Intel NPU driver is loaded
ls /sys/bus/pci/drivers/intel_vpu/

# Check PMT sysfs interface
ls /sys/class/intel_pmt/

# Verify privileged mode
docker inspect metrics-manager | grep Privileged

# Check npu_reader logs
docker exec metrics-manager cat /app/npu_reader_trace.log

# View supervisord status
docker exec metrics-manager supervisorctl -c /etc/supervisor/supervisord.conf status

Solutions:

Issue	Fix
`intel_vpu` driver not loaded	Load it: `sudo modprobe intel_vpu` (host)
Container not privileged	Run with `privileged: true` (Docker) or `docker run --privileged`
`/sys` not accessible	Mount with `--privileged` or `-v /sys:/sys:ro`
Old hardware (pre-PTL)	`npu_memory_mb` returns `-1` — expected on MTL/ARL
No NPU hardware	Expected. Reader logs warning, then enters idle mode. Other metrics continue.

Telegraf Issues#

Telegraf Metrics Empty#

Symptom: :9273/metrics returns 404 or empty response.

Check:

# Verify Telegraf is running
docker exec metrics-manager supervisorctl -c /etc/supervisor/supervisord.conf status telegraf

# Check Telegraf logs
docker logs metrics-manager | grep -i telegraf

# Check Telegraf config syntax
docker exec metrics-manager telegraf -config /etc/telegraf/telegraf.conf -test

Common issues:

Issue	Solution
Telegraf configuration error	Check logs: `docker logs metrics-manager \| grep -i error`
CPU metrics disabled	Verify `[[inputs.cpu]]` block in `telegraf.conf`
Prometheus output not configured	Verify `[[outputs.prometheus_client]]` in `telegraf.conf`

Rate Limiting#

Rate Limited (429 Too Many Requests)#

Symptom: HTTP/1.1 429 Too Many Requests

Check:

# Current rate limit config
curl http://localhost:9090/api/v1/stats | jq '.requests_total, .errors_total'

Solution:

Increase rate limits in .env:

RATE_LIMIT_REQUESTS_PER_MINUTE=5000
RATE_LIMIT_BURST=500

Then restart:

docker compose down && docker compose up -d

Logging and Debugging#

Find Correlation IDs in Logs#

# Search logs by correlation ID
docker logs metrics-manager 2>&1 | grep "correlation_id.*abc-123"

# Or use jq for JSON logs
docker logs metrics-manager 2>&1 | jq 'select(.correlation_id == "abc-123")'

Enable Debug Logging#

LOG_LEVEL=DEBUG docker compose up

This logs all requests, responses, and internal state.

Check Service Health#

# Basic health
curl http://localhost:9090/health | jq .

# Internal stats
curl http://localhost:9090/api/v1/stats | jq .

# Detailed health with store info
curl http://localhost:9090/api/health | jq .

Graceful Shutdown#

Shutdown is handled by uvicorn/FastAPI on SIGTERM and SIGINT:

# Graceful shutdown (waits up to 60s for in-flight requests)
docker compose down

# Force kill (less safe)
docker compose kill

Memory Protection#

Automatic eviction prevents memory exhaustion:

Default limit: 100,000 metrics in memory
Oldest metrics evicted when limit reached
Configure via MAX_METRICS_IN_MEMORY

To see memory usage:

# Docker stats
docker stats metrics-manager

# Inside container
docker exec metrics-manager ps aux | grep uvicorn

Docker-Specific Issues#

Port Already in Use#

Symptom: docker: Error response from daemon: driver failed programming external connectivity on endpoint metrics-manager

Solution:

Change ports in .env:

HOST_METRICS_PORT=19090
HOST_TELEGRAF_PORT=19273
HOST_TELEGRAF_HTTP_PORT=18186

Or find and stop the process using the port:

lsof -i :9090  # Find process on port 9090
kill <PID>     # Kill the process

Insufficient Disk Space for Build#

Symptom: docker build: Build failed — no space left on device

Solution:

# Clean up Docker build cache
docker builder prune

# Or remove all unused images
docker system prune -a

Testing All Endpoints#

Use this script to smoke-test all major endpoints:

#!/bin/bash
set -e

echo "=== 1. Health ==="
curl -s http://localhost:9090/health | jq .

echo -e "\n=== 2. Push Simple Metric ==="
curl -s -X POST http://localhost:9090/api/v1/metrics/simple \
  -H "Content-Type: application/json" \
  -d '{"name": "test_metric", "value": 123.45}' | jq .

echo -e "\n=== 3. Query Metrics ==="
curl -s http://localhost:9090/api/v1/metrics/latest | jq '.metrics | keys'

echo -e "\n=== 4. Prometheus Format ==="
curl -s http://localhost:9090/metrics | head

echo -e "\n=== 5. Telegraf Endpoint ==="
curl -s http://localhost:9273/metrics | head

echo -e "\n=== 6. SSE Stream (first event) ==="
timeout 2 curl -N -H "Accept: text/event-stream" http://localhost:9090/metrics/stream || true

echo -e "\n=== All tests passed! ==="

Getting Help#

Check logs: docker logs metrics-manager
Check service health: curl http://localhost:9090/api/health
Increase log level: LOG_LEVEL=DEBUG docker compose up
Search GitHub issues: Edge AI Libraries Issues Page (use metrics-manager label)
Manual endpoint testing: Use curl commands from API Reference

Supporting Resources#

License#

SPDX-License-Identifier: Apache-2.0

Troubleshooting#

Connection and Startup#

Connection Refused on Port 9090#

Capabilities API#

Capabilities Endpoint Not Reachable#

Capabilities Response Missing Hardware Enrichment#

Metrics Not Appearing#

Custom Metric Not Appearing on /api/v1/metrics#

Custom Metric Not Appearing in Prometheus (:9273/metrics)#

Custom Metric Not Appearing in SSE Stream#

GPU and NPU Metrics#

No GPU Metrics#

No NPU Metrics#

Telegraf Issues#

Telegraf Metrics Empty#

Rate Limiting#

Rate Limited (429 Too Many Requests)#

Logging and Debugging#

Find Correlation IDs in Logs#

Enable Debug Logging#

Check Service Health#

Graceful Shutdown#

Memory Protection#

Docker-Specific Issues#

Port Already in Use#

Insufficient Disk Space for Build#

Testing All Endpoints#

Getting Help#

Supporting Resources#

License#

This Page

Custom Metric Not Appearing on `/api/v1/metrics`#

Custom Metric Not Appearing in Prometheus (`:9273/metrics`)#