Environment Variables#
Configuration is managed via environment variables. All variables map directly to Pydantic Settings field names (case-insensitive). The .env file is loaded automatically by docker compose up.
Core Settings#
Variable |
Default |
Description |
|---|---|---|
|
|
API server bind address |
|
|
Metrics Manager API port |
|
|
Service name used in logs and health checks |
|
|
Service version reported in health endpoints |
|
|
Deployment environment: |
|
|
Logging level: |
|
|
Log format: |
|
|
Include timestamp field in log entries |
|
|
Allowed CORS origins (comma-separated or JSON array). Set to |
|
|
Allow credentials in CORS requests (cookies, etc.) |
|
(unset) |
Override the |
Metrics Storage#
Variable |
Default |
Description |
|---|---|---|
|
|
How long to keep metrics in memory (seconds). After this duration, metrics are expired and removed on next store access |
|
|
Maximum metrics per single batch request ( |
|
|
Maximum metrics in memory. When exceeded, oldest entries are evicted automatically |
|
|
Directory where custom metric scripts are executed (via Telegraf |
|
|
Debounce interval for Telegraf HTTP persistence (milliseconds). Higher values reduce HTTP calls but increase latency. Range: 10–5000 ms |
Rate Limiting#
Variable |
Default |
Description |
|---|---|---|
|
|
Enable rate limiting by client IP |
|
|
Maximum requests per minute per client IP |
|
|
Burst allowance (tokens available before rate limit kicks in) |
Exempt paths: /health, /api/v1/stats, and SSE endpoints are NOT rate limited.
Performance#
Variable |
Default |
Description |
|---|---|---|
|
|
Enable gzip compression for HTTP responses >1 KB. Reduces bandwidth but increases CPU usage |
Security#
Variable |
Default |
Description |
|---|---|---|
|
|
Honor |
Warning: Setting this to
truewithout a reverse proxy allows clients to spoof their IP and bypass rate limiting.
Telegraf Settings (Application Configuration)#
These variables configure the Telegraf locations for the application to use:
Variable |
Default |
Description |
|---|---|---|
|
|
Path to Telegraf config inside the container (informational, not loaded by the app) |
|
|
Telegraf Prometheus endpoint port (where SSE poller fetches system metrics) |
|
|
Telegraf HTTP listener endpoint used to persist custom metrics in InfluxDB Line Protocol. Must be accessible from the app container |
SSE Poller Settings#
The SSE endpoint (/metrics/stream) polls the Telegraf Prometheus endpoint for each connected client independently. There is no shared queue.
Variable |
Default |
Description |
|---|---|---|
|
|
Polling interval in milliseconds (100–5000). Lower values = more frequent SSE events but higher CPU usage per client. Recommended: 500 ms |
|
|
Telegraf Prometheus endpoint polled by SSE clients. Must be accessible from the app container |
Docker Compose Variables#
These variables are used ONLY by compose.yaml and are NOT read by the application:
Variable |
Default |
Description |
|---|---|---|
|
|
Host path to Telegraf config file (mounted into container) |
|
|
Host path to additional Telegraf configs directory (mounted into container) |
|
|
Host port mapping for Metrics Manager API |
|
|
Host port mapping for Telegraf Prometheus endpoint |
|
|
Host port mapping for Telegraf HTTP listener |
Example Configurations#
Development Setup#
# .env
ENVIRONMENT=development
LOG_LEVEL=DEBUG
LOG_FORMAT=text
CORS_ORIGINS=*
RATE_LIMIT_ENABLED=false
High-Throughput Scenario#
# .env
RATE_LIMIT_REQUESTS_PER_MINUTE=5000
RATE_LIMIT_BURST=500
MAX_METRICS_IN_MEMORY=500000
PROMETHEUS_POLLER_INTERVAL_MS=200
ENABLE_GZIP_COMPRESSION=true
Production with Reverse Proxy#
# .env
ENVIRONMENT=production
LOG_LEVEL=WARNING
LOG_FORMAT=json
CORS_ORIGINS=https://my-dashboard.example.com,https://grafana.example.com
TRUST_FORWARDED_HEADERS=true
METRICS_MANAGER_HOSTNAME=production-node-01
Behind Corporate Proxy#
# .env
http_proxy=http://proxy.example.com:8080
HTTP_PROXY=http://proxy.example.com:8080
https_proxy=http://proxy.example.com:8080
HTTPS_PROXY=http://proxy.example.com:8080
no_proxy=localhost,127.0.0.1
NO_PROXY=localhost,127.0.0.1
Custom Metrics Scripts#
The image ships with a Telegraf inputs.exec block that, every 10 seconds, runs every executable *.sh and *.py file it finds in /app/custom-metrics/ and feeds the stdout straight into the Prometheus endpoint on :9273. For details, see Custom Metrics Scripts.
Simply dropping a script into the directory is the easiest way to publish a metric the service does not collect by default.
How the Directory is Wired#
The directory is created by the Dockerfile (
/app/custom-metrics)In
compose.yamlit is mounted as a named volumecustom-metrics:, so scripts survive container restartstelegraf.confships this block (do not edit unless you know what you are doing):
[[inputs.exec]]
commands = ["/bin/sh -c 'for f in /app/custom-metrics/*.sh /app/custom-metrics/*.py; do [ -f \"$f\" ] && [ -x \"$f\" ] && \"$f\"; done 2>/dev/null; true'"]
timeout = "5s"
data_format = "influx"
interval = "10s"
Script Requirements#
Each script must:
Be executable (
chmod +x)Print InfluxDB Line Protocol on stdout, one metric per line
Finish within 5 seconds (Telegraf kills longer runs)
Produce clean output (no debug prints, banners, or stderr)
Handle errors gracefully (non-zero exit codes do not crash Telegraf)
Example: Fan RPM Metric#
See Custom Metrics Scripts for a complete end-to-end example.
Optional Components#
The Metrics Manager image includes optional components that are bundled but not active by default:
qmmd (Prometheus GPU Exporter)#
What it is: A lightweight Prometheus exporter for Intel Arc GPUs. It reads GPU metrics from sysfs and exposes them in Prometheus format.
Current Status: Bundled in the image but NOT started by default.
Why not enabled: The default Metrics Manager already collects GPU metrics via qmassa_reader.py and Telegraf’s inputs.execd. Using both would be redundant.
When to enable: If you want a dedicated GPU metrics exporter that outputs to a separate Prometheus port (typically :9100 or similar) without going through Telegraf.
How to enable: Edit supervisord.conf or extend it in your downstream image:
[program:qmmd]
command=/usr/local/bin/qmmd
autostart=true
autorestart=true
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0
priority=40
License: Prometheus exporter for Intel® GPUs, published on crates.io under the MIT license. See https://crates.io/crates/qmmd.
Optional Services & Extending supervisord#
The container’s bundled supervisord.conf contains an [include] section:
[include]
files=/etc/supervisor/conf.d/*.conf
Any *.conf file dropped into /etc/supervisor/conf.d/ is picked up automatically at supervisord start, so you do not need to fork or edit supervisord.conf to add your own programs.
Pattern: Add a Program in Your Downstream Image#
FROM intel/metrics-manager:2026.1.0
# Drop additional supervisord program units into the include directory.
COPY my-extra-service.conf /etc/supervisor/conf.d/my-extra-service.conf
my-extra-service.conf:
[program:my-extra-service]
command=/usr/local/bin/my-extra-service --flag
autostart=true
autorestart=true
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0
priority=40 ; >30 = starts after metrics-manager (priority 20)
Verify the Extra Unit is Running#
docker exec metrics-manager supervisorctl -c /etc/supervisor/supervisord.conf status
# my-extra-service RUNNING pid 47, uptime 0:01:23
Custom Telegraf Configuration#
Mount Custom Config#
# compose.yaml
services:
metrics-manager:
volumes:
- ./my-telegraf.conf:/etc/telegraf/telegraf.conf:ro
Or via environment:
TELEGRAF_CONFIG=./my-telegraf.conf docker compose up
Additional Config Directory#
Drop additional .conf files in telegraf.d/:
mkdir telegraf.d
echo '[[inputs.exec]]
commands = ["my-custom-script.sh"]
interval = "10s"
data_format = "json"
' > telegraf.d/custom-input.conf
Example: Disable GPU or NPU Metrics#
The default telegraf.conf registers both GPU (qmassa) and NPU readers as [[inputs.execd]] blocks. To disable:
# my-telegraf.conf (omit GPU/NPU inputs)
[agent]
interval = "1s"
[[outputs.prometheus_client]]
listen = ":9273"
[[inputs.cpu]]
[[inputs.mem]]
# GPU and NPU inputs omitted
Structured Logging#
Logs are output in JSON format by default for easy parsing by log aggregators:
{
"timestamp": "2026-03-04T10:15:30.123456Z",
"level": "INFO",
"logger": "app.routes",
"message": "Accepted metrics via batch",
"correlation_id": "abc-123-def",
"extra": {"count": 10}
}
Switch to human-readable format for development:
LOG_FORMAT=text docker compose up
Correlation IDs#
Every request is assigned a correlation ID for distributed tracing. Pass your own via header or receive an auto-generated UUID:
curl -H "X-Correlation-ID: my-trace-123" http://localhost:9090/api/v1/metrics
# Response header: X-Correlation-ID: my-trace-123
Correlation IDs appear in all log entries for request tracing.
Supporting Resources#
License#
Copyright (C) 2025-2026 Intel Corporation
SPDX-License-Identifier: Apache-2.0