# Multimodal Predictive Maintenance Blueprint
> **Blueprint Series** — Edge AI Predictive Maintenance for Critical Infrastructure\ > Multimodal Sensor Fusion + Multi-Agent Reasoning + Edge Vision with Intel OpenVINO Multimodal Predictive Maintenance is a multimodal extension of the Predictive Maintenance Pipeline, where visual image data and chemical sensor readings are fused at inference time to produce a single, more reliable classification result. The gas detection use case is used as the reference implementation throughout this document. The three-unit architecture, agent reasoning layer, prompt system, configuration hierarchy, and deployment principles documented in predictive_maintenance_pipeline.md remain unchanged. This document focuses exclusively on what is new or different in the multimodal variant: the sensor modality, the fusion mechanism, and the changes to the inference and data layers that support them. ## Multimodality A single sensor modality can be misleading. A camera sees smoke but cannot distinguish perfume from combustible gas. A chemical sensor detects elevated MQ2 readings but cannot localise the source. Fusing independent signals reduces false positives and improves classification confidence under ambiguous conditions — exactly the failure modes that matter most in safety-critical industrial environments. The multimodal pipeline combines: - **Image classifier** — captures the visual spectral signature of the gas cloud - **Sensor MLP** — distils seven MQ-series electrochemical sensor readings into a class probability vector - **Late fusion** — produces a single weighted-average classification per sample Because the two modalities are physically independent (a camera fault does not affect sensors, and sensor saturation does not affect image quality), the fused prediction degrades gracefully under partial failure. ## The Dataset and Gas Classes The gas detection dataset contains two aligned data sources that are the foundation of multimodal inference. ### Image Data Images are organised into four class directories and a flat validation split: ``` datasets/gas_detection/images/ ├── Mixture/ # Multiple gases present simultaneously ├── NoGas/ # Baseline / clean air readings ├── Perfume/ # Aromatic compound — non-toxic reference class ├── Smoke/ # Combustion products — elevated hazard └── val/ # Mixed validation set (all four classes) ``` Image filenames encode their identity, e.g. `586_Perfume.png`. The stem (`586_Perfume`) is the key used to look up the corresponding sensor row in the CSV. | Class | Description | |---------|----------------------------------------------------------| | Mixture | Co-presence of multiple gas types — complex spectral pattern | | NoGas | Clean air baseline — used for false-positive calibration | | Perfume | Aromatic compound reference — non-toxic, narrow spectrum | | Smoke | Combustion-product signature — safety-critical class | ### Sensor Data Chemical sensor measurements are stored in a single CSV: ``` datasets/gas_detection/sensor_data/Gas_Sensors_Measurements.csv ``` Each row corresponds to one image sample and records the raw ADC readings from seven MQ-series electrochemical sensors: | Column | Description | |--------------------------|------------------------------------------------| | `Serial Number` | Row index | | `MQ2` | Combustible gas / LPG / propane / hydrogen | | `MQ3` | Alcohol / ethanol / benzene | | `MQ5` | LPG / natural gas / coal gas | | `MQ6` | LPG / butane / propane | | `MQ7` | Carbon monoxide | | `MQ8` | Hydrogen | | `MQ135` | Air quality / ammonia / sulfide / benzene | | `Gas` | Ground-truth class label | | `Corresponding Image Name` | Image stem used to join with image files | The join key between sensor rows and image files is the `Corresponding Image Name` column, which matches the stem of the image filename (without extension). This alignment enables per-sample fusion of image and sensor predictions at inference time. ## The AI Models ### Image Classification Model — YOLOv8 Classifier on OpenVINO A YOLOv8 classification model (not a detection model) is trained on the gas detection image dataset. Unlike the detection variant described in [predictive_maintenance_pipeline.md](predictive_maintenance_pipeline.md#the-ai-models), this model assigns a single class label per image rather than producing bounding boxes. - **Task:** Image classification (4-class softmax output) - **Input size:** 640×640 - **Output:** Class probability vector of length 4 - **Model path:** `models/ov_models/gas_detection/image/best.xml` - **Inference device:** GPU (direct OpenVINO, not via DL Streamer) - **Pre-processing:** Resize → float32 normalization [0,1] → HWC→NCHW transpose > **Note on inference backend:** Because DL Streamer's `gvainference` element has > limitations with FP32-input classification models, the image classification path > uses direct OpenVINO Runtime inference (`run_image_classification` in > `run_inference_oep.py`) rather than the DL Streamer Docker pipeline. The detection > pipeline described in the base blueprint continues to use DL Streamer. ### Sensor MLP Model — Small Pretrained Network on OpenVINO A compact Multi-Layer Perceptron is pre-trained on the tabular sensor data. It is the primary contribution of the multimodal extension — a lightweight, purpose-built network that processes seven sensor readings and emits a class probability vector of the same shape as the image classifier output, enabling direct weighted fusion. - **Task:** 4-class classification from 7-dimensional sensor input - **Input:** Z-score-normalised MQ-sensor vector `[MQ2, MQ3, MQ5, MQ6, MQ7, MQ8, MQ135]` - **Output:** Class probability vector of length 4 (softmax; logits converted at runtime) - **Model path:** `models/ov_models/gas_detection/sensor_mlp/sensor_mlp.xml` - **Inference device:** CPU (the model is small; CPU avoids GPU memory pressure) - **Pre-processing:** Z-score normalisation using dataset-wide mean and standard deviation computed inline at inference time from the full CSV The MLP is deliberately small. Its role is not to replace the image model but to contribute an independent, complementary signal. On samples where the image is ambiguous (e.g., diffuse smoke that looks similar to clean air), the sensor readings often provide the discriminating evidence. Conversely, on samples where sensors are near saturation or noisy, the image provides the stabilising signal. ### Reasoning Models — LLMs on OpenVINO Identical to the base pipeline. See [predictive_maintenance_pipeline.md — Reasoning Models](predictive_maintenance_pipeline.md#reasoning-models----llms-on-openvino-genai). ## Architecture: The Three-Unit Stack The three-unit structure is unchanged from the base blueprint. The multimodal extensions affect Units 1 and 2. Unit 3 (agent reasoning) operates identically — it reads structured records from SQLite regardless of how many modalities produced them. ``` ┌──────────────────────────────────────────────────────────────┐ │ Web UI │ │ Pipeline Execution · Interactive Chat · Agent Outputs │ ├──────────────────────────────────────────────────────────────┤ │ Unit 3: Agent Reasoning │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Policy │ │ Analysis │ │ Evidence │ │ Ticketing│ │ │ │ Agent │ │ Agent │ │ Agent │ │ Agent │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ └─────────────┴─────────────┴──────────────┘ │ │ ▲ │ │ Meta-Agent │ │ (Coordinator) │ ├──────────────────────────────────────────────────────────────┤ │ Unit 2: Data / Storage Layer │ │ │ │ SQLite — detections.db │ │ image_id · source · label · confidence │ │ image_confidence · sensor_confidence │ ├──────────────────────────────────────────────────────────────┤ │ Unit 1: Inference / Ingestion │ │ │ │ ┌────────────────────┐ ┌──────────────────────────────┐ │ │ │ Image Modality │ │ Sensor Modality │ │ │ │ │ │ │ │ │ │ YOLOv8 Classifier │ │ MQ2 · MQ3 · MQ5 · MQ6 │ │ │ │ (OpenVINO, GPU) │ │ MQ7 · MQ8 · MQ135 │ │ │ │ → P(class|image) │ │ Z-score norm → MLP (CPU) │ │ │ │ │ │ → P(class|sensors) │ │ │ └─────────┬──────────┘ └──────────────┬───────────────┘ │ │ └──────────────┬───────────────┘ │ │ ▼ │ │ Late Fusion: weighted average │ │ 0.6 × P(image) + 0.4 × P(sensor) → argmax │ └──────────────────────────────────────────────────────────────┘ ▲ Intel CPU/iGPU/NPU ``` ### Unit 1: Inference and Ingestion — Multimodal Unit 1 now runs three sequential inference stages for each batch of images. #### Stage A — Image Classification The image classifier processes each image in the validation set independently: 1. Load image → resize to 640×640 → normalise to float32 [0, 1] 2. Transpose HWC → NCHW and add batch dimension 3. Run OpenVINO compiled model on GPU 4. Collect softmax output vector `P_image[i]` of length 4 per image `i` 5. If the model outputs raw logits (detected by negative values or non-unit sum), apply softmax: `exp(x - max(x)) / sum(exp(x - max(x)))` The result is a dictionary `image_probs: {image_name → [p0, p1, p2, p3]}`. #### Stage B — Sensor MLP Inference The sensor MLP processes the aligned CSV rows for the same images: 1. Load `Gas_Sensors_Measurements.csv` into a lookup keyed by image stem 2. Compute Z-score normalisation parameters from the **full dataset** (not just the current batch): `μ = mean(all_rows)`, `σ = std(all_rows)`, clipping `σ = 1` where `σ = 0` to prevent division by zero 3. For each image name, retrieve its sensor row, normalise: `x_norm = (x - μ) / σ`, and run through the compiled OpenVINO MLP on CPU 4. Apply softmax to convert logits to probabilities if needed 5. If no sensor row exists for an image (missing data), fall back to a uniform distribution `[0.25, 0.25, 0.25, 0.25]` The result is a dictionary `sensor_probs: {image_name → [p0, p1, p2, p3]}`. #### Stage C — Late Fusion Late fusion combines the two probability vectors using a configurable weighted average: ``` P_fused[i] = w_image × P_image[i] + w_sensor × P_sensor[i] P_fused[i] = P_fused[i] / sum(P_fused[i]) # re-normalise predicted_class = argmax(P_fused[i]) ``` Default weights for the gas detection use case: `w_image = 0.6`, `w_sensor = 0.4`. These are set in `config/gas_detection/config.yaml` under `fusion_weights` and can be tuned without retraining either model. The fused result record for each image carries: - `label` — final predicted class name - `confidence` — `max(P_fused)`, the fused probability of the predicted class - `image_confidence` — `P_image[predicted_class_idx]`, image branch contribution - `sensor_confidence` — `P_sensor[predicted_class_idx]`, sensor branch contribution This three-field confidence breakdown is written through to the SQLite database, enabling downstream agents to reason about the contribution of each modality to any given classification. **Handling missing modalities at runtime:** - If image inference fails for an image, `P_image` defaults to a uniform distribution before fusion — the sensor signal still contributes. - If no sensor row is found for an image, `P_sensor` defaults to uniform — the image signal still contributes. - In this way, the fused pipeline degrades gracefully rather than failing hard when one modality is unavailable. #### Visualization After fusion, `generate_classification_viz` annotates each source image with three prediction lines: ``` Image: