Edge Node Platform Observability Agent#

Background#

This document provides high-level design and implementation guidelines. Refer to Platform Observability Agent in Edge Node Agents GitHub repository for implementation details.

Target Audience#

The target audience for this document is:

Developers interested in contributing to the implementation of the Platform Observability Agent.
Administrators and System Architects interested in the architecture, design and functionality of the Platform Observability Agent.

Overview#

Platform Observability Agent is part of the Open Edge Platform’s Edge Node Zero Touch Provisioning. It is installed, configured and automatically executed at Provisioning time.

The Platform Observability Agent (POA) is a set of four observability agents deployed as individual systemd services alongside the other Edge Node Agents on the Edge Node. These services are:

platform-observability-logging, which is a FluentBit* service for scraping logs from all Edge Node Agents except health check logs.
platform-observability-health-check, which is a FluentBit service for scraping health check logs from Edge Node Agents.
platform-observability-metrics, which is a Telegraf* service for scraping metrics from the Edge Node Agents as well as from the Edge Node HW.
platform-observability-collector, which is a OpenTelemetry* Collector service for gathering and forwarding all logs and hardware metrics to the Edge Orchestrator.

Architecture Diagram#

The Platform Observability Agent follows the architecture and design principles set out in High-Level Architecture

High-Level Architecture of the Platform Observability Agent — Figure 1: High-Level Architecture of Platform Observability Agent#

Key Components#

The Platform Observability Agent is a system daemon packaged as a .deb or .rpm package (depending on target Operating System).
The Platform Observability Agent requires a designated JWT token
FluentBit service with config at /etc/fluent-bit/fluent-bit.conf
Health check service with config at /etc/health-check/health-check.conf
Telegraf service with config at /etc/telegraf/telegraf.d/telegraf.conf
OpenTelemetry service with config at /etc/otelcol/otelcol.yaml

Data Flow#

The data flow of the Platform Observability Agent can be broken down into multiple concepts called out in Workflow Stages section.

Workflow Stages#

Log Collection configuration:
- FluentBit configuration for platform-observability-logging
  - Inputs
    
    systemd input plugin configured to capture Kubernetes Engine service logs
    
    systemd input plugin configured to capture hardware-discovery-agent service logs
    
    systemd input plugin configured to capture cluster-agent service logs
    
    systemd input plugin configured to capture node-agent service logs
    
    systemd input plugin configured to capture platform-telemetry-agent service logs
    
    systemd input plugin configured to capture platform-update-agent service logs
    
    Tail input plugin configured to capture INBC logs from /var/lib/dispatcher/upload/*
    
    systemd input plugin configured to capture RKE-server service logs
    
    systemd input plugin configured to capture RKE-system-agent service logs
    
    systemd input plugin configured to capture platform-observability-metrics service logs
    
    systemd input plugin configured to capture platform-observability-collector service logs
  - Outputs
    
    Forward output plugin configured to send logs to the log socket file provided by the OpenTelemetry Collector service.
  - Options
    
    Buffering using host file system enabled
- Fluent Bit configuration for platform-observability-health-check
  - Inputs
    
    Exec input plugin configured to capture Kubernetes Engine service status
    
    Exec input plugin configured to capture hardware-discovery-agent service status
    
    Exec input plugin configured to capture cluster-agent service status
    
    Exec input plugin configured to capture node-agent service status
    
    Exec input plugin configured to capture platform-telemetry-agent service status
    
    Exec input plugin configured to capture platform-update-agent service status
    
    Exec input plugin configured to capture RKE-server service status
    
    Exec input plugin configured to capture RKE-system-agent service status
  - Outputs
    
    Forward output plugin configured to send logs to the log socket file provided by the OpenTelemetry Collector service.
    
    Options
    
    Buffering using host file system enabled
- OpenTelemetry Collector configuration
  - Receivers
    
    Fluentforward input plugin configured to receive logs from the platform-observability-logging systemd service.
    
    Fluentforward input plugin configured to receive system logs from the cluster fluent bit service.
    
    Fluentforward input plugin configured to receive application logs from the cluster fluent bit service.
    
    Fluentforward input plugin configured to receive container logs from the cluster fluent bit service.
    
    Processors - Memory limiter processor plugin configures the maximum memory usage for the collector service.
    
    Batch processor plugin configures the settings for batching received logs in the collector before sending.
    
    Attributes processor plugin applies the edge node UUID as a tag onto the logs before the collector sends them to the Edge Orchestrator.
  - Exporters
    
    Otlphttp exporter plugin configured to send platform-observability-logging service logs to the Edge Orchestrator log endpoint.
    
    Otlphttp exporter plugin configured to send system logs from the cluster fluentbit service to the Edge Orchestrator log endpoint.
    
    Otlphttp exporter plugin configured to send application logs from the cluster fluentbit service to the Edge Orchestrator log endpoint.
    
    Otlphttp exporter plugin configured to send container logs from the cluster fluentbit service to the Edge Orchestrator log endpoint.
  - Extensions
    
    Bearer token authentication extension plugin applies the JWT token as a HTTP header to the collector output to Edge Orchestrator.
```
        flowchart TD
   I1[KE service] -->|logs| Collector
   I2[Hardware Discovery Agent] -->|logs| Collector
   I3[Cluster Agent] -->|logs| Collector
   I4[Node Agent] -->|logs| Collector
   I5[Vault Agent] -->|logs| Collector
   I6[Platform Update Agent] -->|logs| Collector
   I7[INBC] -->|logs| Collector
   I8[RKE System Agent] -->|logs| Collector
   I9[RKE Server] -->|logs| Collector
   I10[Telegraf] -->|logs| Collector
   I11[Otel Collector] -->|logs| Collector
   I12[Telemetry Agent] -->|logs| Collector
   I13[AppArmour] -->|logs| Collector
   I14[Process] -->|logs| Collector
   I15[EN Users] -->|logs| Collector
   I16[Firewall] -->|logs| Collector
   I17[Host] -->|logs| Collector
   I18[OS] -->|logs| Collector
   Collector --> Routing
   Routing --> Orchestrator
    
```

Figure 2: Log Collection configuration

Metrics Collection configuration:
- Telegraf configuration
  - Inputs
    
    CPU input plugin enables gathering of CPU related metrics from the HW.
    
    Memory input plugin enables gathering of memory related metrics from the HW.
    
    Disk input plugin enables gathering of disk related metrics from the HW.
    
    Disk IO input plugin enables gathering of diskio related metrics from the HW.
    
    Net input plugin enables gathering of network related metrics from the HW.
    
    Temp input plugin enables gathering of temperature related metrics from the HW.
    
    IPMI sensor input plugin enables gathering of IPMI related metrics from the HW using the ipmitool. Disabled by default.
    
    SMART input plugin enables gathering of storage device related metrics from the HW using smartctl. Disabled by default.
    
    Intel powerstat input plugin enables gathering of power related metrics from Intel based platforms. Disabled by default.
    
    RAS input plugin enables gathering of error metrics from the RASDaemon in the HW. Disabled by default.
  - Outputs
    
    OpenTelemetry output plugin configured to send metrics to the metrics socket file provided by the OpenTelemetry Collector service.
- OpenTelemetery Collector configuration
  - Receivers
    
    Otlp input plugin configured to receive HW metrics from Telegraf as well as metrics from the Edge Node Agents.
  - Processors
    
    Memory limiter processor plugin configures the maximum memory usage for the collector service.
    
    Batch processor plugin configures the settings for batching received metrics in the collector before sending.
    
    Attributes processor plugin applies the edge node UUID as a tag onto the metrics before the collector sends them to the Edge Orchestrator.
  - Exporters
    
    Otlphttp exporter plugin configured to send metrics to the Edge Orchestrator metrics endpoint.
  - Extensions
    
    Bearer token authentication extension plugin applies the JWT token as a HTTP header to the collector output to Orchestrator.
```
        flowchart TD
   I1[Telegraf] -->|metrics| Collector
   I2[Node Agent] -->|metrics| Collector
   I3[Cluster Agent] -->|metrics| Collector
   I4[Hardware Agent] -->|metrics| Collector
   I5[Platform Update Agent] -->|metrics| Collector
   Collector --> Routing
   Routing --> Orchestrator
    
```

Figure 3: Metrics Collection configuration

Extensibility#

The Platform Observability Agent functionality can be extended by making source code changes.

Deployment#

The Platform Observability Agent is deployed as a set of system daemons via installation of a .deb package during the provisioning or .rpm package as part of the Edge Microvisor Toolkit.

The POA installs four services, platform-observability-logging, platform-observability-health-check, platform-observability-metrics and platform-observability-collector, when deployed on to the Edge Node.

Each service file is stored in the /lib/systemd/system/ folder as <service_name>.service.

The config file for the platform-observability-logging service is stored in /etc/fluent-bit/fluent-bit.conf.

The config file for the platform-observability-health-check service is stored in /etc/health-check/health-check.conf.

The config file for the platform-observability-metrics service is stored in /etc/telegraf/telegraf.conf.

The config file for the platform-observability-collector service is stored in /etc/otelcol/otelcol.yaml. Logs for each service can be viewed using the journalctl tool.

Technology Stack#

Below sections provide an overview of various aspects of the Platform Observability Agent’s technology stack.

Implementation#

The Platform Observability Agent is implemented as a set of observability services configured for collection of desired logs and metrics.

System Diagram#

Platform Observability Agent depends on Edge Orchestrator endpoints:

Edge Orchestrator central log collector service endpoint.
Edge Orchestrator central metrics collector service endpoint.

Platform Observability Agent external telemetry collectors:

Official fluent-bit Debian package.
Official telegraf Debian package.
Official Otel Collector Debian package.

Figure 4: Platform Observability Agent system diagram#

Integrations#

Platform Observability Agent does not expose an API, it exposes metrics to the endpoints of the Edge Orchestrator.

Platform Observability Agent integrates the 3rd party metric collectors - FluentBit, Telegraf, OpenTelemetry collector.

Security#

Security Policies#

Platform Observability Agent adheres to Edge Node Agents High-Level Architecture security design principle.

Auditing#

Platform Observability Agent adheres to Edge Node Agents High-Level Architecture observability design principle.

Upgrades#

Platform Observability Agent adheres to Edge Node Agents High-Level Architecture upgrade design principle.