Metrics Manager#
Metrics Manager is an open-source, container-ready service for unified collection, ingestion, and real-time relay of system and application metrics on edge and cloud nodes. It bundles Telegraf-based hardware telemetry (CPU, memory, temperature, Intel® GPU, Intel® NPU) with a FastAPI REST surface that accepts custom metrics in JSON, InfluxDB Line Protocol, and OpenTelemetry formats, and exposes them through a Prometheus-compatible endpoint as well as a Server-Sent Events (SSE) stream suitable for live dashboards.
Key Benefits#
Multi-format Ingestion: Accepts metrics from any source — JSON, InfluxDB Line Protocol, OpenTelemetry (OTLP), or simple single-metric endpoints
Real-time Streaming: Live SSE stream for dashboards without polling overhead; browser-friendly HTML UI included
Hardware Telemetry: Automatic collection from CPU, RAM, temperature sensors, Intel® Arc GPU (qmassa), and Intel® NPU (via PMT)
Container-Ready: Single Docker image with all dependencies; runs on Kubernetes via Helm chart
Low Latency: In-memory metrics with configurable retention; debounced persistence to avoid bottlenecks
Features#
Four REST API formats for ingestion: JSON Batch, Simple JSON, InfluxDB Line Protocol, OpenTelemetry (OTLP)
Prometheus-compatible output (
/metricsand/metrics/latestendpoints)Server-Sent Events (SSE) streaming (
/metrics/stream) with automatic HTML UI in browsersCustom metrics scripts — accepts executables dropped into
/app/custom-metrics/and runs them every 10s via TelegrafRate limiting — token bucket per IP, configurable burst
Structured JSON logging with correlation IDs for distributed tracing
Health checks — basic and detailed endpoints with service statistics
Memory protection — automatic eviction of oldest metrics when limit reached (default 100k metrics)
Flexible configuration — 30+ environment variables for tuning a wide range of aspects
Docker Compose & Kubernetes — production-ready
compose.yamland Helm chart included
Use Cases#
Edge AI Inference: Monitor model latency, throughput, GPU/NPU utilization in real-time
System Monitoring: Collect CPU, RAM, temperature from heterogeneous edge nodes (Intel Arc GPU, NPU)
Live Dashboards: Stream metrics to ViPPET, Grafana, or a custom WebUI without polling
Multi-source Aggregation: Ingest metrics from Telegraf agents, OpenTelemetry collectors, and custom applications in one place
Telemetry Integration: Accept metrics from any framework (OTLP) or protocol (InfluxDB Line Protocol) without code changes
Key Metrics Collected#
System Metrics (via Telegraf, every 1 second):
CPU: per-core usage (user, system, idle), frequency, temperature
Memory: used/available percentage, total, used bytes
Intel Arc GPU: engine usage (compute, render, copy, video), frequency, power
Intel NPU: power, frequency, temperature, utilization, bandwidth, tile configuration, memory
Temperature: CPU package temperature via
coretemp
Custom Metrics:
Accept any metric format (JSON, Influx, OTLP) via REST API
Automatic tag support (e.g.,
{"source": "camera1", "model": "yolov8"})Configurable retention (default 300 seconds)
Architecture Overview#
Metrics flow through three main channels:
System Metrics: Telegraf agents collect CPU, memory, GPU, NPU data every 1 second and expose them on
:9273/metricsin Prometheus formatCustom Metrics: Applications push metrics via REST API (
/api/v1/metrics/*) → stored in-memory → debounced persistence to Telegraf:8186/write→ appear in Prometheus endpointReal-time Streaming: SSE clients connect to
/metrics/stream→ poller queries Telegraf:9273every 500ms → broadcasts metrics as Server-Sent Events
All metrics are available on three endpoints:
GET /metrics— Prometheus text format (custom metrics only)GET /api/v1/metrics/latest— JSON format with latest valuesGET /metrics/stream— SSE stream for live dashboards (system + custom)
For details, see How It Works.
Supporting Resources#
License#
Copyright (C) 2025-2026 Intel Corporation
SPDX-License-Identifier: Apache-2.0