Troubleshooting#
This guide covers common issues and solutions.
Connection and Startup#
Connection Refused on Port 9090#
Symptom: curl: (7) Failed to connect to localhost port 9090: Connection refused
Check:
Container is running:
docker ps | grep metrics-managerPort is bound:
docker port metrics-managerService is healthy:
docker logs metrics-manager | tail -20
Solution:
# Check if container is running
docker ps
# If not running, start it
docker compose up -d
# Check logs for errors
docker logs metrics-manager
Metrics Not Appearing#
Custom Metric Not Appearing on /api/v1/metrics#
Symptom: Metric accepted (201 response) but doesn’t appear in query.
Check:
# Verify metric was accepted
curl -X POST http://localhost:9090/api/v1/metrics/simple \
-H "Content-Type: application/json" \
-d '{"name": "test_metric", "value": 42}'
# Query immediately
curl http://localhost:9090/api/v1/metrics | jq '.metrics | keys'
Causes and Solutions:
Cause |
Solution |
|---|---|
Metric expired (default 300s) |
Set |
Memory limit reached |
Set |
Telegraf :8186 unreachable |
Check Telegraf is running: |
Invalid metric format |
Check request format matches one of the four supported formats |
Custom Metric Not Appearing in Prometheus (:9273/metrics)#
Symptom: Custom metric appears in /api/v1/metrics but not in :9273/metrics.
Root cause: Metric hasn’t been persisted to Telegraf yet (debounced 100ms by default).
Solution:
# Wait a moment and check again
sleep 1
curl http://localhost:9273/metrics | grep my_metric
Or reduce debounce delay:
FILE_PERSIST_DEBOUNCE_MS=10 docker compose up
Custom Metric Not Appearing in SSE Stream#
Symptom: Metric in Prometheus endpoint but not in /metrics/stream.
Root cause: SSE client not polling frequently enough (default 500ms).
Solution:
Reduce polling interval:
PROMETHEUS_POLLER_INTERVAL_MS=100 docker compose up
Wait for next polling cycle:
# Default 500ms polling interval sleep 1 curl -N -H "Accept: text/event-stream" http://localhost:9090/metrics/stream
GPU and NPU Metrics#
No GPU Metrics#
Symptom: No gpu_* or engine_usage_* metrics in :9273/metrics.
Check:
# Verify Intel GPU is present
lspci | grep -i intel | grep -i graphics
# Check qmassa FIFO exists
docker exec metrics-manager ls -la /app/qmassa.fifo
# Check if qmassa is running
docker exec metrics-manager supervisorctl -c /etc/supervisor/supervisord.conf status qmassa
# View qmassa logs
docker logs metrics-manager | grep qmassa
Solutions:
Issue |
Fix |
|---|---|
No Intel GPU |
Expected. qmassa logs |
|
Ensure |
Old GPU drivers |
Update GPU drivers: |
qmassa process crashed |
Check logs: |
No NPU Metrics#
Symptom: No npu_power, npu_frequency, etc. in :9273/metrics.
Check:
# Verify Intel NPU driver is loaded
ls /sys/bus/pci/drivers/intel_vpu/
# Check PMT sysfs interface
ls /sys/class/intel_pmt/
# Verify privileged mode
docker inspect metrics-manager | grep Privileged
# Check npu_reader logs
docker exec metrics-manager cat /app/npu_reader_trace.log
# View supervisord status
docker exec metrics-manager supervisorctl -c /etc/supervisor/supervisord.conf status
Solutions:
Issue |
Fix |
|---|---|
|
Load it: |
Container not privileged |
Run with |
|
Mount with |
Old hardware (pre-PTL) |
|
No NPU hardware |
Expected. Reader logs warning, then enters idle mode. Other metrics continue. |
Telegraf Issues#
Telegraf Metrics Empty#
Symptom: :9273/metrics returns 404 or empty response.
Check:
# Verify Telegraf is running
docker exec metrics-manager supervisorctl -c /etc/supervisor/supervisord.conf status telegraf
# Check Telegraf logs
docker logs metrics-manager | grep -i telegraf
# Check Telegraf config syntax
docker exec metrics-manager telegraf -config /etc/telegraf/telegraf.conf -test
Common issues:
Issue |
Solution |
|---|---|
Telegraf configuration error |
Check logs: |
CPU metrics disabled |
Verify |
Prometheus output not configured |
Verify |
Rate Limiting#
Rate Limited (429 Too Many Requests)#
Symptom: HTTP/1.1 429 Too Many Requests
Check:
# Current rate limit config
curl http://localhost:9090/api/v1/stats | jq '.requests_total, .errors_total'
Solution:
Increase rate limits in .env:
RATE_LIMIT_REQUESTS_PER_MINUTE=5000
RATE_LIMIT_BURST=500
Then restart:
docker compose down && docker compose up -d
Logging and Debugging#
Find Correlation IDs in Logs#
# Search logs by correlation ID
docker logs metrics-manager 2>&1 | grep "correlation_id.*abc-123"
# Or use jq for JSON logs
docker logs metrics-manager 2>&1 | jq 'select(.correlation_id == "abc-123")'
Enable Debug Logging#
LOG_LEVEL=DEBUG docker compose up
This logs all requests, responses, and internal state.
Check Service Health#
# Basic health
curl http://localhost:9090/health | jq .
# Internal stats
curl http://localhost:9090/api/v1/stats | jq .
# Detailed health with store info
curl http://localhost:9090/api/health | jq .
Graceful Shutdown#
Shutdown is handled by uvicorn/FastAPI on SIGTERM and SIGINT:
# Graceful shutdown (waits up to 60s for in-flight requests)
docker compose down
# Force kill (less safe)
docker compose kill
Memory Protection#
Automatic eviction prevents memory exhaustion:
Default limit: 100,000 metrics in memory
Oldest metrics evicted when limit reached
Configure via
MAX_METRICS_IN_MEMORY
To see memory usage:
# Docker stats
docker stats metrics-manager
# Inside container
docker exec metrics-manager ps aux | grep uvicorn
Docker-Specific Issues#
Port Already in Use#
Symptom: docker: Error response from daemon: driver failed programming external connectivity on endpoint metrics-manager
Solution:
Change ports in
.env:HOST_METRICS_PORT=19090 HOST_TELEGRAF_PORT=19273 HOST_TELEGRAF_HTTP_PORT=18186
Or find and stop the process using the port:
lsof -i :9090 # Find process on port 9090 kill <PID> # Kill the process
Insufficient Disk Space for Build#
Symptom: docker build: Build failed — no space left on device
Solution:
# Clean up Docker build cache
docker builder prune
# Or remove all unused images
docker system prune -a
Testing All Endpoints#
Use this script to smoke-test all major endpoints:
#!/bin/bash
set -e
echo "=== 1. Health ==="
curl -s http://localhost:9090/health | jq .
echo -e "\n=== 2. Push Simple Metric ==="
curl -s -X POST http://localhost:9090/api/v1/metrics/simple \
-H "Content-Type: application/json" \
-d '{"name": "test_metric", "value": 123.45}' | jq .
echo -e "\n=== 3. Query Metrics ==="
curl -s http://localhost:9090/api/v1/metrics/latest | jq '.metrics | keys'
echo -e "\n=== 4. Prometheus Format ==="
curl -s http://localhost:9090/metrics | head
echo -e "\n=== 5. Telegraf Endpoint ==="
curl -s http://localhost:9273/metrics | head
echo -e "\n=== 6. SSE Stream (first event) ==="
timeout 2 curl -N -H "Accept: text/event-stream" http://localhost:9090/metrics/stream || true
echo -e "\n=== All tests passed! ==="
Getting Help#
Check logs:
docker logs metrics-managerCheck service health:
curl http://localhost:9090/api/healthIncrease log level:
LOG_LEVEL=DEBUG docker compose upSearch GitHub issues: Edge AI Libraries Issues Page (use
metrics-managerlabel)Manual endpoint testing: Use curl commands from API Reference
Supporting Resources#
License#
Copyright (C) 2025-2026 Intel Corporation
SPDX-License-Identifier: Apache-2.0