Metrics Manager#

GitHub Readme

Metrics Manager is an open-source, container-ready service for unified collection, ingestion, and real-time relay of system and application metrics on edge and cloud nodes. It bundles Telegraf-based hardware telemetry (CPU, memory, temperature, Intel® GPU, Intel® NPU) with a FastAPI REST surface that accepts custom metrics in JSON, InfluxDB Line Protocol, and OpenTelemetry formats, and exposes them through a Prometheus-compatible endpoint as well as a Server-Sent Events (SSE) stream suitable for live dashboards.

Key Benefits#

Multi-format Ingestion: Accepts metrics from any source — JSON, InfluxDB Line Protocol, OpenTelemetry (OTLP), or simple single-metric endpoints
Real-time Streaming: Live SSE stream for dashboards without polling overhead; browser-friendly HTML UI included
Hardware Telemetry: Automatic collection from CPU, RAM, temperature sensors, Intel® Arc GPU (qmassa), and Intel® NPU (via PMT)
Container-Ready: Single Docker image with all dependencies; runs on Kubernetes via Helm chart
Low Latency: In-memory metrics with configurable retention; debounced persistence to avoid bottlenecks

Features#

Four REST API formats for ingestion: JSON Batch, Simple JSON, InfluxDB Line Protocol, OpenTelemetry (OTLP)
Prometheus-compatible output (/metrics and /metrics/latest endpoints)
Server-Sent Events (SSE) streaming (/metrics/stream) with automatic HTML UI in browsers
Custom metrics scripts — accepts executables dropped into /app/custom-metrics/ and runs them every 10s via Telegraf
Rate limiting — token bucket per IP, configurable burst
Structured JSON logging with correlation IDs for distributed tracing
Health checks — basic and detailed endpoints with service statistics
Memory protection — automatic eviction of oldest metrics when limit reached (default 100k metrics)
Flexible configuration — 30+ environment variables for tuning a wide range of aspects
Docker Compose & Kubernetes — production-ready compose.yaml and Helm chart included

Use Cases#

Edge AI Inference: Monitor model latency, throughput, GPU/NPU utilization in real-time
System Monitoring: Collect CPU, RAM, temperature from heterogeneous edge nodes (Intel Arc GPU, NPU)
Live Dashboards: Stream metrics to ViPPET, Grafana, or a custom WebUI without polling
Multi-source Aggregation: Ingest metrics from Telegraf agents, OpenTelemetry collectors, and custom applications in one place
Telemetry Integration: Accept metrics from any framework (OTLP) or protocol (InfluxDB Line Protocol) without code changes

Key Metrics Collected#

System Metrics (via Telegraf, every 1 second):

CPU: per-core usage (user, system, idle), frequency, temperature
Memory: used/available percentage, total, used bytes
Intel Arc GPU: engine usage (compute, render, copy, video), frequency, power
Intel NPU: power, frequency, temperature, utilization, bandwidth, tile configuration, memory
Temperature: CPU package temperature via coretemp

Custom Metrics:

Accept any metric format (JSON, Influx, OTLP) via REST API
Automatic tag support (e.g., {"source": "camera1", "model": "yolov8"})
Configurable retention (default 300 seconds)

Architecture Overview#

Metrics Manager Microservice Architecture

Metrics flow through three main channels:

System Metrics: Telegraf agents collect CPU, memory, GPU, NPU data every 1 second and expose them on :9273/metrics in Prometheus format
Custom Metrics: Applications push metrics via REST API (/api/v1/metrics/*) → stored in-memory → debounced persistence to Telegraf :8186/write → appear in Prometheus endpoint
Real-time Streaming: SSE clients connect to /metrics/stream → poller queries Telegraf :9273 every 500ms → broadcasts metrics as Server-Sent Events

All metrics are available on three endpoints:

GET /metrics — Prometheus text format (custom metrics only)
GET /api/v1/metrics/latest — JSON format with latest values
GET /metrics/stream — SSE stream for live dashboards (system + custom)

For details, see How It Works.

Supporting Resources#

License#

SPDX-License-Identifier: Apache-2.0