Metrics Manager#

Metrics Manager is an open-source, container-ready service for unified collection, ingestion, and real-time relay of system and application metrics on edge and cloud nodes. It bundles Telegraf-based hardware telemetry (CPU, memory, temperature, Intel® GPU, Intel® NPU) with a FastAPI REST surface that accepts custom metrics in JSON, InfluxDB Line Protocol, and OpenTelemetry formats, and exposes them through a Prometheus-compatible endpoint as well as a Server-Sent Events (SSE) stream suitable for live dashboards.

Key Benefits#

  • Multi-format Ingestion: Accepts metrics from any source — JSON, InfluxDB Line Protocol, OpenTelemetry (OTLP), or simple single-metric endpoints

  • Real-time Streaming: Live SSE stream for dashboards without polling overhead; browser-friendly HTML UI included

  • Hardware Telemetry: Automatic collection from CPU, RAM, temperature sensors, Intel® Arc GPU (qmassa), and Intel® NPU (via PMT)

  • Container-Ready: Single Docker image with all dependencies; runs on Kubernetes via Helm chart

  • Low Latency: In-memory metrics with configurable retention; debounced persistence to avoid bottlenecks

Features#

  • Four REST API formats for ingestion: JSON Batch, Simple JSON, InfluxDB Line Protocol, OpenTelemetry (OTLP)

  • Prometheus-compatible output (/metrics and /metrics/latest endpoints)

  • Server-Sent Events (SSE) streaming (/metrics/stream) with automatic HTML UI in browsers

  • Custom metrics scripts — accepts executables dropped into /app/custom-metrics/ and runs them every 10s via Telegraf

  • Rate limiting — token bucket per IP, configurable burst

  • Structured JSON logging with correlation IDs for distributed tracing

  • Health checks — basic and detailed endpoints with service statistics

  • Memory protection — automatic eviction of oldest metrics when limit reached (default 100k metrics)

  • Flexible configuration — 30+ environment variables for tuning a wide range of aspects

  • Docker Compose & Kubernetes — production-ready compose.yaml and Helm chart included

Use Cases#

  • Edge AI Inference: Monitor model latency, throughput, GPU/NPU utilization in real-time

  • System Monitoring: Collect CPU, RAM, temperature from heterogeneous edge nodes (Intel Arc GPU, NPU)

  • Live Dashboards: Stream metrics to ViPPET, Grafana, or a custom WebUI without polling

  • Multi-source Aggregation: Ingest metrics from Telegraf agents, OpenTelemetry collectors, and custom applications in one place

  • Telemetry Integration: Accept metrics from any framework (OTLP) or protocol (InfluxDB Line Protocol) without code changes

Key Metrics Collected#

System Metrics (via Telegraf, every 1 second):

  • CPU: per-core usage (user, system, idle), frequency, temperature

  • Memory: used/available percentage, total, used bytes

  • Intel Arc GPU: engine usage (compute, render, copy, video), frequency, power

  • Intel NPU: power, frequency, temperature, utilization, bandwidth, tile configuration, memory

  • Temperature: CPU package temperature via coretemp

Custom Metrics:

  • Accept any metric format (JSON, Influx, OTLP) via REST API

  • Automatic tag support (e.g., {"source": "camera1", "model": "yolov8"})

  • Configurable retention (default 300 seconds)

Architecture Overview#

Metrics Manager Microservice Architecture

Metrics flow through three main channels:

  1. System Metrics: Telegraf agents collect CPU, memory, GPU, NPU data every 1 second and expose them on :9273/metrics in Prometheus format

  2. Custom Metrics: Applications push metrics via REST API (/api/v1/metrics/*) → stored in-memory → debounced persistence to Telegraf :8186/write → appear in Prometheus endpoint

  3. Real-time Streaming: SSE clients connect to /metrics/stream → poller queries Telegraf :9273 every 500ms → broadcasts metrics as Server-Sent Events

All metrics are available on three endpoints:

  • GET /metrics — Prometheus text format (custom metrics only)

  • GET /api/v1/metrics/latest — JSON format with latest values

  • GET /metrics/stream — SSE stream for live dashboards (system + custom)

For details, see How It Works.

Supporting Resources#

License#

Copyright (C) 2025-2026 Intel Corporation

SPDX-License-Identifier: Apache-2.0