Model Selection Guide: Wind Turbine Anomaly Detection#

Overview#

Select and integrate your own ML models with our time-series analytics infrastructure.

What You Can Do:

  • Use any model (Random Forest, XGBoost, Neural Networks, etc.)

  • Any format (pkl, ONNX, PyTorch, TensorFlow, joblib, custom)

  • Modify UDF to load/run your model

  • Leverage GPU acceleration


Application Context#

Problem: Detect anomalies in wind turbine operations using real-time SCADA data

  • Input: Wind Speed (m/s), Grid Active Power (kW)

  • Output: Expected Grid Active Power (kW)

  • Deployment: Edge infrastructure (CPU or GPU accelerated)

  • Framework: Kapacitor UDF (User Defined Function)

  • Mode: Single-point streaming only (live data, point-by-point)

  • No batch processing: Each data point processed individually as it arrives


Reference Implementation#

Current Model: Random Forest Regressor (350 trees, max_depth=25)

  • File: windturbine_anomaly_detector.pkl

  • Performance: MAE < 50 kW, R² > 0.95

  • Why: Good balance of accuracy, speed, and interpretability

You can replace this with any model - see integration section below.


Quick Start Code#

# XGBoost with GPU
from xgboost import XGBRegressor
model = XGBRegressor(n_estimators=350, max_depth=25, tree_method='gpu_hist')

# LightGBM with GPU
from lightgbm import LGBMRegressor
model = LGBMRegressor(n_estimators=350, max_depth=25, device='gpu')

# Simple Polynomial
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly = PolynomialFeatures(degree=3)
model = LinearRegression()

# PyTorch MLP
import torch.nn as nn
class WindTurbineNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(1, 64), nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(64, 32), nn.ReLU(),
            nn.Linear(32, 1)
        )
    def forward(self, x): return self.net(x)

Model Integration#

Supported Serialization Formats#

# 1. Pickle/Joblib (sklearn models)
import pickle, joblib
pickle.dump(model, open('model.pkl', 'wb'))
model = pickle.load(open('model.pkl', 'rb'))

# 2. ONNX (cross-platform)
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
onnx_model = convert_sklearn(model, initial_types=[('input', FloatTensorType([None, 1]))])

# 3. PyTorch
torch.save(model.state_dict(), 'model.pt')
model.load_state_dict(torch.load('model.pt'))

# 4. TensorFlow
model.save('model')
model = tf.keras.models.load_model('model')

# 5. OpenVINO (Intel hardware optimization)
# Use mo tool to convert ONNX/PyTorch/TF to IR format

Update UDF for Your Model#

Edit windturbine_anomaly_detector.py:

# 1. Update loading
def load_model(filename, format=None):
    """
    Load a model from disk.

    Parameters
    ----------
    filename : str
        Path to the model file.
    format : str, optional
        Model format: "pkl", "onnx", "pt"/"torch". If None, inferred from file extension.
    """
    import os

    # Infer format from file extension if not provided
    if format is None:
        _, ext = os.path.splitext(filename)
        format = ext.lstrip(".").lower()

    # Pickle (default)
    if format in ("pkl", "pickle", ""):
        import pickle
        with open(filename, "rb") as f:
            return pickle.load(f)

    # ONNX
    # elif format == "onnx":
    #     import onnxruntime as rt
    #     return rt.InferenceSession(filename)

    # PyTorch
    # elif format in ("pt", "torch"):
    #     import torch
    #     model = YourModelClass()
    #     model.load_state_dict(torch.load(filename))
    #     return model.eval()

    # Unknown / unsupported format
    raise ValueError(f"Unsupported model format: {format!r}")
# 2. Update inference (single point only)
y_pred = self.model.predict(np.reshape(x, (-1, 1)))  # sklearn - single point
# y_pred = self.model.run(None, {'input': x})[0]  # ONNX - single point
# y_pred = self.model(torch.tensor([[x]])).item()  # PyTorch - single point

# Note: Batch inference NOT supported - process one point at a time

Update config.json#

{
    "udfs": {
        "name": "windturbine_anomaly_detector",
        "models": "your_model.pkl",  // or .onnx, .pt, etc.
        "device": "gpu"  // or "cpu"
    }
}

Intel Optimizations#

# For sklearn models - significant speedup
from sklearnex import patch_sklearn
patch_sklearn()

# For deep learning - use OpenVINO
# Converts models to optimized IR format for Intel hardware

Model Performance Requirements#

Performance Testing Protocol#

  1. Split Data: 70% train, 30% test

  2. Cross-Validation: 5-fold CV on training set

  3. Test Set: Never used during training/tuning

  4. Hardware Testing: Profile on target edge device (both CPU and GPU)

  5. Load Testing: Simulate 10+ concurrent turbine streams (each processing single points)

  6. GPU Efficiency: Measure GPU memory usage and utilization

GPU-Specific Tests:

  • Monitor VRAM consumption during streaming inference

  • Test CPU fallback behavior if GPU unavailable

  • Benchmark GPU vs CPU for single-point inference

  • Verify multi-stream concurrent processing (10+ turbines)


Data Requirements#

Training Data: 10k-50k samples minimum, 6+ months coverage

Preprocessing (remove these points):

  • Wind speed < 3 m/s or > 14 m/s (cut-in/cut-out)

  • Power < 50 kW when 3 < wind_speed < 14 (curtailment)

  • NaN/missing values

  • Known anomalies (use separate validation set)

def preprocess(df):
    df = df.dropna()
    return df[
        (df['wind_speed'] >= 3) & (df['wind_speed'] <= 14) &
        (df['grid_activepower'] >= 50)
    ]

Additional Features (optional): wind direction, temperature, blade pitch, rotor speed


Using Your Own Dataset#

Dataset Requirements#

Your dataset should contain:

  • Required columns: wind_speed, grid_activepower (or equivalent power output column)

  • Optional columns: timestamp, wind_direction, temperature, blade pitch, rotor speed

  • Format: CSV, Parquet, or any pandas-readable format

  • Size: 10k-50k samples minimum for training

Note: If you use a dataset different from the reference, you are responsible for training a model with the required feature set and aligning the preprocessing steps accordingly.

Example Dataset Formats#

Current reference dataset (T1.csv):

timestamp,grid_activepower,wind_speed,Theoretical_Power_Curve,Wind Direction (°)
01 01 2018 00:00,380.05,5.31,416.33,259.99
01 01 2018 00:10,453.77,5.67,519.92,268.64

Your dataset - adapt column names:

time,power_output,wind_speed_ms,temperature
2024-01-01 00:00:00,385.2,5.3,15.2
2024-01-01 00:10:00,448.1,5.6,15.4

Changes Required for Your Dataset#

1. Update column names in training notebook:

Edit training/windturbine_anomaly_detection.ipynb:

# Original
X = df[['wind_speed']]
y = df[['Theoretical_Power_Curve']]  # or 'grid_activepower'

# Your dataset - update to match your column names
X = df[['wind_speed_ms']]  # or whatever your wind speed column is called
y = df[['power_output']]    # or your power column name

2. Update preprocessing function:

def preprocess(df):
    # Map your column names to expected names
    df = df.rename(columns={
        'power_output': 'grid_activepower',
        'wind_speed_ms': 'wind_speed'
    })

    df = df.dropna()
    return df[
        (df['wind_speed'] >= 3) & (df['wind_speed'] <= 14) &
        (df['grid_activepower'] >= 50)
    ]

3. Update data loading:

# Load your dataset
df = pd.read_csv('path/to/your_dataset.csv')

# If using different time format
df['timestamp'] = pd.to_datetime(df['time'])

# If you have multiple turbines, filter for specific one
df = df[df['turbine_id'] == 'T1']

4. Adjust operational parameters (if your turbine specs differ):

# In training notebook and UDF (windturbine_anomaly_detector.py)
cut_in_speed = 3      # Your turbine's cut-in speed (m/s)
cut_out_speed = 14    # Your turbine's cut-out speed (m/s)
min_power_th = 50     # Minimum power threshold (kW)
error_threshold = 0.15  # Adjust based on your data

5. Update simulation data (optional):

Place your dataset in simulation-data/:

cp your_dataset.csv simulation-data/wind-turbine-anomaly-detection.csv

Update Telegraf config to read from your file:

# telegraf-config/Telegraf.conf
[[inputs.file]]
  files = ["simulation-data/your_dataset.csv"]
  data_format = "csv"
  csv_column_names = ["wind_speed", "grid_activepower"]  # Your columns

Dataset Validation Checklist#

Before training:

  • [ ] Verify column names match expected format (or update code)

  • [ ] Check data types (numeric for wind_speed and power)

  • [ ] Validate wind speed range (typically 0-25 m/s)

  • [ ] Check for missing values (< 5% recommended)

  • [ ] Verify timestamp format if using temporal features

  • [ ] Confirm power values are in kW (or convert to kW)

  • [ ] Remove or separate known anomalies for validation

  • [ ] Ensure sufficient samples in operational range (3-14 m/s)

Multi-Turbine Datasets#

If your dataset contains multiple turbines:

# Option 1: Train separate models per turbine
for turbine_id in df['turbine_id'].unique():
    turbine_df = df[df['turbine_id'] == turbine_id]
    model = train_model(turbine_df)
    save_model(model, f'windturbine_{turbine_id}_v1.0.pkl')

# Option 2: Train single model with turbine as feature
X = df[['wind_speed', 'turbine_id_encoded']]
y = df['grid_activepower']
model = train_model(X, y)

# Option 3: Use shared model (assumes similar turbines)
df_combined = df  # Use all turbines together
model = train_model(df_combined)

Testing Checklist#

Before deploying:

Accuracy:

  • [ ] MAE, RMSE, R² meet targets (see criteria table)

  • [ ] Train/test gap < 5% (check overfitting)

  • [ ] Test on labeled anomalies (FP/FN rates)

Performance:

  • [ ] Inference < 10ms (avg of 1000 predictions)

  • [ ] Concurrent streams work (test 10+ turbines)

  • [ ] Model loads < 5 seconds

Integration:

  • [ ] Model loads correctly in UDF

  • [ ] Prediction interface works

  • [ ] Config.json updated

  • [ ] Works on target hardware

Robustness:

  • [ ] Handles NaN/missing values

  • [ ] Out-of-range inputs don’t crash

  • [ ] Memory usage acceptable

Evaluation Script:

from sklearn.metrics import mean_absolute_error, r2_score
import time, numpy as np

def evaluate(model, X_test, y_test):
    # Accuracy
    y_pred = model.predict(X_test)
    print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f} kW")
    print(f"R²: {r2_score(y_test, y_pred):.4f}")

    # Inference speed
    num_iters = 1000
    start = time.perf_counter()
    for _ in range(num_iters):
        model.predict(X_test[:1])
    elapsed = time.perf_counter() - start
    avg_latency_ms = (elapsed / num_iters) * 1000
    print(f"Avg latency per inference: {avg_latency_ms:.2f} ms")

Retraining Guidelines#

When to Retrain:

  • MAE increases >20% from baseline

  • False positive rate >50% increase

  • New turbine model/configuration

  • Quarterly (recommended schedule)

  • After 3-6 months new data

Process:

  1. Collect 6-12 months operational data

  2. Remove maintenance periods, validate quality

  3. Train with same pipeline, compare vs current

  4. A/B test on subset (1-2 weeks)

  5. Gradual rollout: 10% → 50% → 100%

Version Control: windturbine_anomaly_detector_vX.Y.<format>

  • X = algorithm change

  • Y = retrain same algorithm


Appendix: Hyperparameter Tuning Guidance#

Random Forest Tuning#

Key Hyperparameters:

  1. n_estimators (number of trees):

    • Default: 350

    • Range: 100-500

    • Higher = better accuracy but larger model size

    • Diminishing returns after ~300-400

  2. max_depth (tree depth):

    • Default: 25

    • Range: 10-30

    • Higher = risk of overfitting

    • Lower = underfitting

  3. min_samples_split:

    • Default: 2

    • Range: 2-10

    • Higher = prevents overfitting but may underfit

  4. min_samples_leaf:

    • Default: 1

    • Range: 1-5

    • Higher = smoother predictions

  5. max_features:

    • Default: ‘sqrt’

    • Options: ‘sqrt’, ‘log2’, None

    • Controls feature sampling per split

Tuning Template:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [200, 350, 500],
    'max_depth': [20, 25, 30],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2']
}

grid_search = GridSearchCV(
    RandomForestRegressor(random_state=42),
    param_grid,
    cv=5,
    scoring='neg_mean_absolute_error',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

Appendix: Troubleshooting#

Issue

Solution

High false positives

Increase error_threshold (0.15 default), adjust n_steps window

Missing anomalies

Decrease threshold, improve model accuracy

Slow inference (>50ms)

Reduce trees, enable Intel optimizations, simpler model

Overfitting

Reduce max_depth, increase min_samples_split, more data

Poor generalization

Multi-turbine training data, feature engineering

Integration Checklist#

  • [ ] Model saved (.pkl, .onnx, .pt, etc.)

  • [ ] config.json updated with model filename

  • [ ] Model in time-series-analytics-config/models/

  • [ ] UDF loading logic updated

  • [ ] Prediction interface compatible

  • [ ] Tested on target hardware

  • [ ] Performance validated (latency, accuracy)

  • [ ] Model version documented


Appendix: References and Resources#

Documentation#

Data Sources#

Tools and Libraries#

  • Intel® Extension for Scikit-learn: Performance optimization

  • Scikit-learn: Model training and evaluation

  • Kapacitor: Time series processing and UDF framework

  • Grafana: Visualization dashboard