Model Selection Guide: Wind Turbine Anomaly Detection#
Overview#
Select and integrate your own ML models with our time-series analytics infrastructure.
What You Can Do:
Use any model (Random Forest, XGBoost, Neural Networks, etc.)
Any format (pkl, ONNX, PyTorch, TensorFlow, joblib, custom)
Modify UDF to load/run your model
Leverage GPU acceleration
Application Context#
Problem: Detect anomalies in wind turbine operations using real-time SCADA data
Input: Wind Speed (m/s), Grid Active Power (kW)
Output: Expected Grid Active Power (kW)
Deployment: Edge infrastructure (CPU or GPU accelerated)
Framework: Kapacitor UDF (User Defined Function)
Mode: Single-point streaming only (live data, point-by-point)
No batch processing: Each data point processed individually as it arrives
Reference Implementation#
Current Model: Random Forest Regressor (350 trees, max_depth=25)
File:
windturbine_anomaly_detector.pklPerformance: MAE < 50 kW, R² > 0.95
Why: Good balance of accuracy, speed, and interpretability
You can replace this with any model - see integration section below.
Quick Start Code#
# XGBoost with GPU
from xgboost import XGBRegressor
model = XGBRegressor(n_estimators=350, max_depth=25, tree_method='gpu_hist')
# LightGBM with GPU
from lightgbm import LGBMRegressor
model = LGBMRegressor(n_estimators=350, max_depth=25, device='gpu')
# Simple Polynomial
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly = PolynomialFeatures(degree=3)
model = LinearRegression()
# PyTorch MLP
import torch.nn as nn
class WindTurbineNet(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(1, 64), nn.ReLU(), nn.Dropout(0.2),
nn.Linear(64, 32), nn.ReLU(),
nn.Linear(32, 1)
)
def forward(self, x): return self.net(x)
Model Integration#
Supported Serialization Formats#
# 1. Pickle/Joblib (sklearn models)
import pickle, joblib
pickle.dump(model, open('model.pkl', 'wb'))
model = pickle.load(open('model.pkl', 'rb'))
# 2. ONNX (cross-platform)
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
onnx_model = convert_sklearn(model, initial_types=[('input', FloatTensorType([None, 1]))])
# 3. PyTorch
torch.save(model.state_dict(), 'model.pt')
model.load_state_dict(torch.load('model.pt'))
# 4. TensorFlow
model.save('model')
model = tf.keras.models.load_model('model')
# 5. OpenVINO (Intel hardware optimization)
# Use mo tool to convert ONNX/PyTorch/TF to IR format
Update UDF for Your Model#
Edit windturbine_anomaly_detector.py:
# 1. Update loading
def load_model(filename, format=None):
"""
Load a model from disk.
Parameters
----------
filename : str
Path to the model file.
format : str, optional
Model format: "pkl", "onnx", "pt"/"torch". If None, inferred from file extension.
"""
import os
# Infer format from file extension if not provided
if format is None:
_, ext = os.path.splitext(filename)
format = ext.lstrip(".").lower()
# Pickle (default)
if format in ("pkl", "pickle", ""):
import pickle
with open(filename, "rb") as f:
return pickle.load(f)
# ONNX
# elif format == "onnx":
# import onnxruntime as rt
# return rt.InferenceSession(filename)
# PyTorch
# elif format in ("pt", "torch"):
# import torch
# model = YourModelClass()
# model.load_state_dict(torch.load(filename))
# return model.eval()
# Unknown / unsupported format
raise ValueError(f"Unsupported model format: {format!r}")
# 2. Update inference (single point only)
y_pred = self.model.predict(np.reshape(x, (-1, 1))) # sklearn - single point
# y_pred = self.model.run(None, {'input': x})[0] # ONNX - single point
# y_pred = self.model(torch.tensor([[x]])).item() # PyTorch - single point
# Note: Batch inference NOT supported - process one point at a time
Update config.json#
{
"udfs": {
"name": "windturbine_anomaly_detector",
"models": "your_model.pkl", // or .onnx, .pt, etc.
"device": "gpu" // or "cpu"
}
}
Intel Optimizations#
# For sklearn models - significant speedup
from sklearnex import patch_sklearn
patch_sklearn()
# For deep learning - use OpenVINO
# Converts models to optimized IR format for Intel hardware
Model Performance Requirements#
Performance Testing Protocol#
Split Data: 70% train, 30% test
Cross-Validation: 5-fold CV on training set
Test Set: Never used during training/tuning
Hardware Testing: Profile on target edge device (both CPU and GPU)
Load Testing: Simulate 10+ concurrent turbine streams (each processing single points)
GPU Efficiency: Measure GPU memory usage and utilization
GPU-Specific Tests:
Monitor VRAM consumption during streaming inference
Test CPU fallback behavior if GPU unavailable
Benchmark GPU vs CPU for single-point inference
Verify multi-stream concurrent processing (10+ turbines)
Data Requirements#
Training Data: 10k-50k samples minimum, 6+ months coverage
Preprocessing (remove these points):
Wind speed < 3 m/s or > 14 m/s (cut-in/cut-out)
Power < 50 kW when 3 < wind_speed < 14 (curtailment)
NaN/missing values
Known anomalies (use separate validation set)
def preprocess(df):
df = df.dropna()
return df[
(df['wind_speed'] >= 3) & (df['wind_speed'] <= 14) &
(df['grid_activepower'] >= 50)
]
Additional Features (optional): wind direction, temperature, blade pitch, rotor speed
Using Your Own Dataset#
Dataset Requirements#
Your dataset should contain:
Required columns:
wind_speed,grid_activepower(or equivalent power output column)Optional columns:
timestamp,wind_direction,temperature, blade pitch, rotor speedFormat: CSV, Parquet, or any pandas-readable format
Size: 10k-50k samples minimum for training
Note: If you use a dataset different from the reference, you are responsible for training a model with the required feature set and aligning the preprocessing steps accordingly.
Example Dataset Formats#
Current reference dataset (T1.csv):
timestamp,grid_activepower,wind_speed,Theoretical_Power_Curve,Wind Direction (°)
01 01 2018 00:00,380.05,5.31,416.33,259.99
01 01 2018 00:10,453.77,5.67,519.92,268.64
Your dataset - adapt column names:
time,power_output,wind_speed_ms,temperature
2024-01-01 00:00:00,385.2,5.3,15.2
2024-01-01 00:10:00,448.1,5.6,15.4
Changes Required for Your Dataset#
1. Update column names in training notebook:
Edit training/windturbine_anomaly_detection.ipynb:
# Original
X = df[['wind_speed']]
y = df[['Theoretical_Power_Curve']] # or 'grid_activepower'
# Your dataset - update to match your column names
X = df[['wind_speed_ms']] # or whatever your wind speed column is called
y = df[['power_output']] # or your power column name
2. Update preprocessing function:
def preprocess(df):
# Map your column names to expected names
df = df.rename(columns={
'power_output': 'grid_activepower',
'wind_speed_ms': 'wind_speed'
})
df = df.dropna()
return df[
(df['wind_speed'] >= 3) & (df['wind_speed'] <= 14) &
(df['grid_activepower'] >= 50)
]
3. Update data loading:
# Load your dataset
df = pd.read_csv('path/to/your_dataset.csv')
# If using different time format
df['timestamp'] = pd.to_datetime(df['time'])
# If you have multiple turbines, filter for specific one
df = df[df['turbine_id'] == 'T1']
4. Adjust operational parameters (if your turbine specs differ):
# In training notebook and UDF (windturbine_anomaly_detector.py)
cut_in_speed = 3 # Your turbine's cut-in speed (m/s)
cut_out_speed = 14 # Your turbine's cut-out speed (m/s)
min_power_th = 50 # Minimum power threshold (kW)
error_threshold = 0.15 # Adjust based on your data
5. Update simulation data (optional):
Place your dataset in simulation-data/:
cp your_dataset.csv simulation-data/wind-turbine-anomaly-detection.csv
Update Telegraf config to read from your file:
# telegraf-config/Telegraf.conf
[[inputs.file]]
files = ["simulation-data/your_dataset.csv"]
data_format = "csv"
csv_column_names = ["wind_speed", "grid_activepower"] # Your columns
Dataset Validation Checklist#
Before training:
[ ] Verify column names match expected format (or update code)
[ ] Check data types (numeric for wind_speed and power)
[ ] Validate wind speed range (typically 0-25 m/s)
[ ] Check for missing values (< 5% recommended)
[ ] Verify timestamp format if using temporal features
[ ] Confirm power values are in kW (or convert to kW)
[ ] Remove or separate known anomalies for validation
[ ] Ensure sufficient samples in operational range (3-14 m/s)
Multi-Turbine Datasets#
If your dataset contains multiple turbines:
# Option 1: Train separate models per turbine
for turbine_id in df['turbine_id'].unique():
turbine_df = df[df['turbine_id'] == turbine_id]
model = train_model(turbine_df)
save_model(model, f'windturbine_{turbine_id}_v1.0.pkl')
# Option 2: Train single model with turbine as feature
X = df[['wind_speed', 'turbine_id_encoded']]
y = df['grid_activepower']
model = train_model(X, y)
# Option 3: Use shared model (assumes similar turbines)
df_combined = df # Use all turbines together
model = train_model(df_combined)
Testing Checklist#
Before deploying:
Accuracy:
[ ] MAE, RMSE, R² meet targets (see criteria table)
[ ] Train/test gap < 5% (check overfitting)
[ ] Test on labeled anomalies (FP/FN rates)
Performance:
[ ] Inference < 10ms (avg of 1000 predictions)
[ ] Concurrent streams work (test 10+ turbines)
[ ] Model loads < 5 seconds
Integration:
[ ] Model loads correctly in UDF
[ ] Prediction interface works
[ ] Config.json updated
[ ] Works on target hardware
Robustness:
[ ] Handles NaN/missing values
[ ] Out-of-range inputs don’t crash
[ ] Memory usage acceptable
Evaluation Script:
from sklearn.metrics import mean_absolute_error, r2_score
import time, numpy as np
def evaluate(model, X_test, y_test):
# Accuracy
y_pred = model.predict(X_test)
print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f} kW")
print(f"R²: {r2_score(y_test, y_pred):.4f}")
# Inference speed
num_iters = 1000
start = time.perf_counter()
for _ in range(num_iters):
model.predict(X_test[:1])
elapsed = time.perf_counter() - start
avg_latency_ms = (elapsed / num_iters) * 1000
print(f"Avg latency per inference: {avg_latency_ms:.2f} ms")
Retraining Guidelines#
When to Retrain:
MAE increases >20% from baseline
False positive rate >50% increase
New turbine model/configuration
Quarterly (recommended schedule)
After 3-6 months new data
Process:
Collect 6-12 months operational data
Remove maintenance periods, validate quality
Train with same pipeline, compare vs current
A/B test on subset (1-2 weeks)
Gradual rollout: 10% → 50% → 100%
Version Control: windturbine_anomaly_detector_vX.Y.<format>
X = algorithm change
Y = retrain same algorithm
Appendix: Hyperparameter Tuning Guidance#
Random Forest Tuning#
Key Hyperparameters:
n_estimators(number of trees):Default: 350
Range: 100-500
Higher = better accuracy but larger model size
Diminishing returns after ~300-400
max_depth(tree depth):Default: 25
Range: 10-30
Higher = risk of overfitting
Lower = underfitting
min_samples_split:Default: 2
Range: 2-10
Higher = prevents overfitting but may underfit
min_samples_leaf:Default: 1
Range: 1-5
Higher = smoother predictions
max_features:Default: ‘sqrt’
Options: ‘sqrt’, ‘log2’, None
Controls feature sampling per split
Tuning Template:
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [200, 350, 500],
'max_depth': [20, 25, 30],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2],
'max_features': ['sqrt', 'log2']
}
grid_search = GridSearchCV(
RandomForestRegressor(random_state=42),
param_grid,
cv=5,
scoring='neg_mean_absolute_error',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
Appendix: Troubleshooting#
Issue |
Solution |
|---|---|
High false positives |
Increase |
Missing anomalies |
Decrease threshold, improve model accuracy |
Slow inference (>50ms) |
Reduce trees, enable Intel optimizations, simpler model |
Overfitting |
Reduce |
Poor generalization |
Multi-turbine training data, feature engineering |
Integration Checklist#
[ ] Model saved (.pkl, .onnx, .pt, etc.)
[ ]
config.jsonupdated with model filename[ ] Model in
time-series-analytics-config/models/[ ] UDF loading logic updated
[ ] Prediction interface compatible
[ ] Tested on target hardware
[ ] Performance validated (latency, accuracy)
[ ] Model version documented
Appendix: References and Resources#
Documentation#
Data Sources#
Primary Dataset: Kaggle Wind Turbine SCADA Dataset
Training File:
training/T1.csvSimulation File:
simulation-data/wind-turbine-anomaly-detection.csv
Tools and Libraries#
Intel® Extension for Scikit-learn: Performance optimization
Scikit-learn: Model training and evaluation
Kapacitor: Time series processing and UDF framework
Grafana: Visualization dashboard
Recommended Reading#
Wind Turbine Power Curve modeling techniques (Eg: https://www.mdpi.com/1996-1073/16/1/180)
Edge AI deployment best practices
Intel optimization guides for ML inference: https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html
Time series anomaly detection methods