Environment Variables#

Configuration is managed via environment variables. All variables map directly to Pydantic Settings field names (case-insensitive). The .env file is loaded automatically by docker compose up.

Core Settings#

Variable	Default	Description
`HOST`	`0.0.0.0`	API server bind address
`METRICS_PORT`	`9090`	Metrics Manager API port
`SERVICE_NAME`	`metrics-manager`	Service name used in logs and health checks
`SERVICE_VERSION`	`2026.1.0`	Service version reported in health endpoints
`ENVIRONMENT`	`production`	Deployment environment: `development`, `staging`, `production`. Production disables `/docs` and `/redoc` Swagger endpoints
`LOG_LEVEL`	`INFO`	Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`
`LOG_FORMAT`	`json`	Log format: `json` (structured, for log aggregators) or `text` (human-readable)
`LOG_INCLUDE_TIMESTAMP`	`true`	Include timestamp field in log entries
`CORS_ORIGINS`	`*`	Allowed CORS origins (comma-separated or JSON array). Set to `http://localhost:3000,http://my-dashboard:3000` to restrict
`CORS_ALLOW_CREDENTIALS`	`false`	Allow credentials in CORS requests (cookies, etc.)
`METRICS_MANAGER_HOSTNAME`	(unset)	Override the `host=` tag stamped on every metric (Telegraf, qmassa_reader.py, npu_reader.py). Unset = use kernel hostname. Set to a stable value (e.g., `lab-node-42`) to keep Grafana dashboards stable across reboots

Metrics Storage#

Variable	Default	Description
`METRICS_RETENTION_SECONDS`	`300`	How long to keep metrics in memory (seconds). After this duration, metrics are expired and removed on next store access
`MAX_METRICS_BATCH_SIZE`	`1000`	Maximum metrics per single batch request (`POST /api/v1/metrics`)
`MAX_METRICS_IN_MEMORY`	`100000`	Maximum metrics in memory. When exceeded, oldest entries are evicted automatically
`CUSTOM_METRICS_DIR`	`/app/custom-metrics`	Directory where custom metric scripts are executed (via Telegraf `inputs.exec`). Do not change inside container
`FILE_PERSIST_DEBOUNCE_MS`	`100`	Debounce interval for Telegraf HTTP persistence (milliseconds). Higher values reduce HTTP calls but increase latency. Range: 10–5000 ms

Rate Limiting#

Variable	Default	Description
`RATE_LIMIT_ENABLED`	`true`	Enable rate limiting by client IP
`RATE_LIMIT_REQUESTS_PER_MINUTE`	`1000`	Maximum requests per minute per client IP
`RATE_LIMIT_BURST`	`100`	Burst allowance (tokens available before rate limit kicks in)

Exempt paths: /health, /api/v1/stats, and SSE endpoints are NOT rate limited.

Performance#

Variable	Default	Description
`ENABLE_GZIP_COMPRESSION`	`true`	Enable gzip compression for HTTP responses >1 KB. Reduces bandwidth but increases CPU usage

Security#

Variable	Default	Description
`TRUST_FORWARDED_HEADERS`	`false`	Honor `X-Forwarded-For` / `X-Real-IP` headers for client IP detection. Set to `true` ONLY when running behind a trusted reverse proxy (Nginx, Traefik, etc.)

Warning: Setting this to true without a reverse proxy allows clients to spoof their IP and bypass rate limiting.

Telegraf Settings (Application Configuration)#

These variables configure the Telegraf locations for the application to use:

Variable	Default	Description
`TELEGRAF_CONFIG_PATH`	`/etc/telegraf/telegraf.conf`	Path to Telegraf config inside the container (informational, not loaded by the app)
`TELEGRAF_PORT`	`9273`	Telegraf Prometheus endpoint port (where SSE poller fetches system metrics)
`TELEGRAF_HTTP_ENDPOINT`	`http://localhost:8186/write`	Telegraf HTTP listener endpoint used to persist custom metrics in InfluxDB Line Protocol. Must be accessible from the app container

SSE Poller Settings#

The SSE endpoint (/metrics/stream) polls the Telegraf Prometheus endpoint for each connected client independently. There is no shared queue.

Variable	Default	Description
`PROMETHEUS_POLLER_INTERVAL_MS`	`500`	Polling interval in milliseconds (100–5000). Lower values = more frequent SSE events but higher CPU usage per client. Recommended: 500 ms
`PROMETHEUS_TELEGRAF_ENDPOINT`	`http://localhost:9273`	Telegraf Prometheus endpoint polled by SSE clients. Must be accessible from the app container

Docker Compose Variables#

These variables are used ONLY by compose.yaml and are NOT read by the application:

Variable	Default	Description
`TELEGRAF_CONFIG`	`./telegraf.conf`	Host path to Telegraf config file (mounted into container)
`TELEGRAF_CONFIG_DIR`	`./telegraf.d`	Host path to additional Telegraf configs directory (mounted into container)
`HOST_METRICS_PORT`	`9090`	Host port mapping for Metrics Manager API
`HOST_TELEGRAF_PORT`	`9273`	Host port mapping for Telegraf Prometheus endpoint
`HOST_TELEGRAF_HTTP_PORT`	`8186`	Host port mapping for Telegraf HTTP listener

Example Configurations#

Development Setup#

# .env
ENVIRONMENT=development
LOG_LEVEL=DEBUG
LOG_FORMAT=text
CORS_ORIGINS=*
RATE_LIMIT_ENABLED=false

High-Throughput Scenario#

# .env
RATE_LIMIT_REQUESTS_PER_MINUTE=5000
RATE_LIMIT_BURST=500
MAX_METRICS_IN_MEMORY=500000
PROMETHEUS_POLLER_INTERVAL_MS=200
ENABLE_GZIP_COMPRESSION=true

Production with Reverse Proxy#

# .env
ENVIRONMENT=production
LOG_LEVEL=WARNING
LOG_FORMAT=json
CORS_ORIGINS=https://my-dashboard.example.com,https://grafana.example.com
TRUST_FORWARDED_HEADERS=true
METRICS_MANAGER_HOSTNAME=production-node-01

Behind Corporate Proxy#

# .env
http_proxy=http://proxy.example.com:8080
HTTP_PROXY=http://proxy.example.com:8080
https_proxy=http://proxy.example.com:8080
HTTPS_PROXY=http://proxy.example.com:8080
no_proxy=localhost,127.0.0.1
NO_PROXY=localhost,127.0.0.1

Custom Metrics Scripts#

The image ships with a Telegraf inputs.exec block that, every 10 seconds, runs every executable *.sh and *.py file it finds in /app/custom-metrics/ and feeds the stdout straight into the Prometheus endpoint on :9273. For details, see Custom Metrics Scripts.

Simply dropping a script into the directory is the easiest way to publish a metric the service does not collect by default.

How the Directory is Wired#

The directory is created by the Dockerfile (/app/custom-metrics)
In compose.yaml it is mounted as a named volume custom-metrics:, so scripts survive container restarts
telegraf.conf ships this block (do not edit unless you know what you are doing):

[[inputs.exec]]
  commands = ["/bin/sh -c 'for f in /app/custom-metrics/*.sh /app/custom-metrics/*.py; do [ -f \"$f\" ] && [ -x \"$f\" ] && \"$f\"; done 2>/dev/null; true'"]
  timeout = "5s"
  data_format = "influx"
  interval = "10s"

Script Requirements#

Each script must:

Be executable (chmod +x)
Print InfluxDB Line Protocol on stdout, one metric per line
Finish within 5 seconds (Telegraf kills longer runs)
Produce clean output (no debug prints, banners, or stderr)
Handle errors gracefully (non-zero exit codes do not crash Telegraf)

Example: Fan RPM Metric#

See Custom Metrics Scripts for a complete end-to-end example.

Optional Components#

The Metrics Manager image includes optional components that are bundled but not active by default:

qmmd (Prometheus GPU Exporter)#

What it is: A lightweight Prometheus exporter for Intel Arc GPUs. It reads GPU metrics from sysfs and exposes them in Prometheus format.

Current Status: Bundled in the image but NOT started by default.

Why not enabled: The default Metrics Manager already collects GPU metrics via qmassa_reader.py and Telegraf’s inputs.execd. Using both would be redundant.

When to enable: If you want a dedicated GPU metrics exporter that outputs to a separate Prometheus port (typically :9100 or similar) without going through Telegraf.

How to enable: Edit supervisord.conf or extend it in your downstream image:

[program:qmmd]
command=/usr/local/bin/qmmd
autostart=true
autorestart=true
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0
priority=40

License: Prometheus exporter for Intel® GPUs, published on crates.io under the MIT license. See https://crates.io/crates/qmmd.

Optional Services & Extending supervisord#

The container’s bundled supervisord.conf contains an [include] section:

[include]
files=/etc/supervisor/conf.d/*.conf

Any *.conf file dropped into /etc/supervisor/conf.d/ is picked up automatically at supervisord start, so you do not need to fork or edit supervisord.conf to add your own programs.

Pattern: Add a Program in Your Downstream Image#

FROM intel/metrics-manager:2026.1.0

# Drop additional supervisord program units into the include directory.
COPY my-extra-service.conf /etc/supervisor/conf.d/my-extra-service.conf

my-extra-service.conf:

[program:my-extra-service]
command=/usr/local/bin/my-extra-service --flag
autostart=true
autorestart=true
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0
stderr_logfile=/dev/stderr
stderr_logfile_maxbytes=0
priority=40   ; >30 = starts after metrics-manager (priority 20)

Verify the Extra Unit is Running#

docker exec metrics-manager supervisorctl -c /etc/supervisor/supervisord.conf status
# my-extra-service                 RUNNING   pid 47, uptime 0:01:23

Custom Telegraf Configuration#

Mount Custom Config#

# compose.yaml
services:
  metrics-manager:
    volumes:
      - ./my-telegraf.conf:/etc/telegraf/telegraf.conf:ro

Or via environment:

TELEGRAF_CONFIG=./my-telegraf.conf docker compose up

Additional Config Directory#

Drop additional .conf files in telegraf.d/:

mkdir telegraf.d
echo '[[inputs.exec]]
  commands = ["my-custom-script.sh"]
  interval = "10s"
  data_format = "json"
' > telegraf.d/custom-input.conf

Example: Disable GPU or NPU Metrics#

The default telegraf.conf registers both GPU (qmassa) and NPU readers as [[inputs.execd]] blocks. To disable:

# my-telegraf.conf (omit GPU/NPU inputs)
[agent]
  interval = "1s"

[[outputs.prometheus_client]]
  listen = ":9273"

[[inputs.cpu]]
[[inputs.mem]]
# GPU and NPU inputs omitted

Structured Logging#

Logs are output in JSON format by default for easy parsing by log aggregators:

{
  "timestamp": "2026-03-04T10:15:30.123456Z",
  "level": "INFO",
  "logger": "app.routes",
  "message": "Accepted metrics via batch",
  "correlation_id": "abc-123-def",
  "extra": {"count": 10}
}

Switch to human-readable format for development:

LOG_FORMAT=text docker compose up

Correlation IDs#

Every request is assigned a correlation ID for distributed tracing. Pass your own via header or receive an auto-generated UUID:

curl -H "X-Correlation-ID: my-trace-123" http://localhost:9090/api/v1/metrics
# Response header: X-Correlation-ID: my-trace-123

Correlation IDs appear in all log entries for request tracing.

Supporting Resources#

License#

SPDX-License-Identifier: Apache-2.0