Critical Importance of Core Pinning on Intel Edge Platforms#
Today’s Intel edge processors are designed around a fundamental principle: power is a shared, finite resource. The processor’s total power budget (package power) is dynamically distributed between: CPU cores (P-cores, E-cores, and LPE-cores), Uncore components including the GPU, NPU (Neural Processing Unit), and Memory controllers and I/O.
With proper core pinning, you can precisely control which cores are active, preventing the operating system’s default scheduler from spreading your application across all available cores, for example, when a single-threaded application wakes up multiple cores unnecessarily.
Core pinning mitigates such problems as:
Increased power consumption from activating unnecessary cores.
Reduced turbo frequencies as the processor throttles to stay within thermal limits.
GPU and NPU power starvation when CPU cores consume the bulk of the power budget.
Cache pollution and memory bandwidth contention from thread migration.
On modern hybrid processors like Intel’s Arrow Lake, Lunar Lake, and Panther Lake platforms, the effect of core pinning is even more pronounced.
Heterogeneous core types: P-cores (Performance), E-cores (Efficient), and LPE-cores (Low Power Efficient) have drastically different power and performance characteristics.
Integrated accelerators: GPU and NPU share the same package power budget with CPU cores.
AI workloads: Vision inference, video analytics, and ML pipelines often combine CPU, GPU, and NPU—power competition becomes critical.
Tools for Detection and Monitoring#
1. Detecting Core Types: obtain_cores.sh#
Intel provides a comprehensive script to detect and enumerate core types on hybrid Intel platforms. The script is available in the edge-workloads-and-benchmarks repository.
Location:
utils/obtain_cores.sh
Usage:
cd utils/
./obtain_cores.sh
Example output:
pcore:0,1,2,3,4,5,6,7
ecore:8,9,10,11,12,13,14,15,16,17,18,19,20,21
lpecore:22,23,24,25
This script uses multiple detection methods with fallbacks:
Multi-socket Xeon detection — assigns all cores as P-cores on server platforms.
CPUID-based detection — uses the
cpuidinstruction to identify Intel Core vs Intel Atom cores.sysfs validation — reads
/sys/devices/cpu_core/,/sys/devices/cpu_atom/,/sys/devices/cpu_lowpower/.L1d cache drop detection — identifies E-cores by detecting cache size transitions.
SMT pair detection — classifies remaining cores based on hyperthreading topology.
The script outputs comma-separated core IDs for each type, which can be directly used with
the taskset command to pin workloads.
Pinning examples:
# Pin to P-cores only (for latency-sensitive workloads)
taskset -c 0,1,2,3,4,5,6,7 ./your_application
# Pin to E-cores only (for throughput workloads)
taskset -c 8,9,10,11,12,13,14,15,16,17,18,19,20,21 ./your_application
# Pin to LPE-cores (for background tasks)
taskset -c 22,23,24,25 ./background_service
For full documentation and usage examples, see the utils README.
2. Power Monitoring Tools#
To verify that core pinning is actually improving your power efficiency and performance, you need to monitor power consumption across all compute resources.
get_package_power.sh — Package Power Monitoring via RAPL and hwmon#
Intel provides a dedicated script to sample platform package power consumption using RAPL (Running Average Power Limit) sysfs interfaces and hardware monitoring sensors. The script is available in the edge-workloads-and-benchmarks repository.
Location: utils/get_package_power.sh
Usage:
sudo ./get_package_power.sh -i <duration (seconds)> -s <sampling interval (seconds)> -d <delay (seconds)>
Options:
-s <seconds>— Sampling interval in seconds (default: 1)-i <seconds>— Total duration in seconds (default: 60)-d <seconds>— Start delay in seconds (default: 0)
Example:
# Measure package power for 60 seconds at 1-second intervals
sudo ./get_package_power.sh -i 60 -s 1
# Measure for 30 seconds with a 5-second delay before starting
sudo ./get_package_power.sh -i 30 -s 1 -d 5
Example output:
[ Info ] Monitoring for 60s after a 0s delay
[rapl] card0 (xe @ 0000:00:02.0): 15.23 W
[rapl] card0 (xe @ 0000:00:02.0): 14.87 W
[rapl] card0 (xe @ 0000:00:02.0): 15.45 W
...
[ Info ] Monitoring complete
Output format:
[source] card# (driver @ pci): power W
source: Either
rapl(RAPL energy counters) orhwmon(hardware monitoring sensors)card#: DRM card identifier (e.g.,
card0)driver: Graphics driver (
i915orxe)pci: PCI device address
power: Instantaneous power consumption in watts
How it works:
Discovers Intel graphics devices via
/sys/class/drm/card*/(i915 or xe drivers)Detects power sensors using either:
Hardware monitoring sensors (
hwmon) with package/card energy or power labelsRAPL energy counters (
/sys/class/powercap/intel-rapl:*/energy_uj)
Samples power by reading energy counters at the start and end of each interval, computing power as:
power (W) = (end_energy - start_energy) / interval_duration
Use cases:
Compare before/after core pinning: Run the script during your workload with and without core pinning to quantify power savings
Monitor GPU power availability: Check if freeing up CPU cores allows the GPU to consume more power (higher frequency)
Long-term profiling: Use with longer durations to understand power patterns over time
Example workflow:
# Baseline: workload without core pinning
sudo ./get_package_power.sh -i 60 -s 1 &
./my_workload
# Optimized: workload with E-core pinning
sudo ./get_package_power.sh -i 60 -s 1 &
taskset -c 8-21 ./my_workload
For full documentation, see the utils README.
turbostat — Detailed Package, Core, and Graphics Power#
Linux’s built-in turbostat utility provides more granular power breakdowns:
sudo turbostat --interval 1
Key metrics to watch:
PkgWatt: Total package power (CPU + GPU + uncore)
CorWatt: Power consumed by CPU cores only
GFXWatt: Graphics (GPU) power consumption
RAMWatt: DRAM power
Example output snippet:
Core CPU Avg_MHz Busy% Bzy_MHz PkgWatt CorWatt GFXWatt
- - 2100 50.0 4200 25.0 15.0 5.0
0 0 4200 100.0 4200
1 1 4200 100.0 4200
By comparing power metrics before and after core pinning, you can quantify the impact on
package power and GPU power availability. Use get_package_power.sh for simple package-level
measurements, and turbostat when you need detailed per-core and component-level breakdowns.
npu-monitor-tool.py — NPU Power and Utilization#
For workloads using the Intel NPU (Neural Processing Unit), monitor NPU-specific metrics using the NPU monitoring tool from the edge-ai-libraries repository.
Location: tools/npu-monitor-tool/npu-monitor-tool.py
Usage:
sudo python3 npu-monitor-tool.py -i 1000
Example output:
+-----------------------------------------------------------------------------------------------+
| INTEL NPU Device: 0x7d1d | version: 1.0.0 |
| Firmware version: IVPU_MTL_20240112_v2024.01 |
+===============================================================================================+
| Power Usage | DPU Freq | NPU DDR Average Bandwidth | Tile Conf |
| 2.5 [W] | 1400 [Hz] | 123.45 [MB/s] | 4 |
+===============================================================================================+
| NPU Temperature | NPU Utilization | Memory Usage |
| 45 [°C] | 25% | 512.00 [MB] |
+-----------------------------------------------------------------------------------------------+
CSV export is available for long-term analysis:
sudo python3 npu-monitor-tool.py --csv -i 1000
This generates timestamped CSV files in npu_output/ with the following columns:
timestamp, power, frequency, bandwidth, tile_config, temperature, utilization, memory_usage
For complete documentation, see the npu-monitor-tool README.
Core Pinning with DL Streamer Pipeline Server#
For AI video analytics workloads using the
DL Streamer Pipeline Server,
Intel provides built-in support for core pinning via the CORE_PINNING environment variable.
This eliminates the need to manually wrap the server with taskset and provides a declarative
way to specify core affinity in Docker Compose or Kubernetes deployments.
Using the CORE_PINNING Environment Variable#
The CORE_PINNING environment variable accepts two types of values:
Explicit core list or range (taskset-compatible syntax):
Comma-delimited list:
10,12,14Range:
10-14Range with step:
10-14/2(cores 10, 12, 14)
Core type specification (automatic detection):
p-cores— Pin to Performance corese-cores— Pin to Efficient coreslp-cores— Pin to Low Power Efficient cores
The server automatically detects the appropriate cores using the same detection logic as
obtain_cores.sh and applies taskset internally.
Docker Compose Example#
version: '3.8'
services:
dlstreamer-pipeline-server:
image: intel/dlstreamer-pipeline-server:2025.2.0-ubuntu22
environment:
CORE_PINNING: p-cores
devices:
- /dev/dri:/dev/dri
ports:
- "8080:8080"
volumes:
- ./pipelines:/home/pipeline-server/pipelines
version: '3.8'
services:
dlstreamer-pipeline-server:
image: intel/dlstreamer-pipeline-server:2025.2.0-ubuntu22
environment:
CORE_PINNING: e-cores
devices:
- /dev/dri:/dev/dri
ports:
- "8080:8080"
volumes:
- ./pipelines:/home/pipeline-server/pipelines
version: '3.8'
services:
dlstreamer-pipeline-server:
image: intel/dlstreamer-pipeline-server:2025.2.0-ubuntu22
environment:
CORE_PINNING: "8-15" # E-cores 8 through 15
devices:
- /dev/dri:/dev/dri
ports:
- "8080:8080"
volumes:
- ./pipelines:/home/pipeline-server/pipelines
Kubernetes Example#
For Kubernetes deployments, set the environment variable in the pod spec:
apiVersion: v1
kind: Pod
metadata:
name: dlstreamer-pipeline-server
spec:
containers:
- name: dlstreamer
image: intel/dlstreamer-pipeline-server:2025.2.0-ubuntu22
env:
- name: CORE_PINNING
value: "p-cores"
resources:
limits:
gpu.intel.com/i915: 1
CORE_PINNING vs Manual taskset#
Recommendation: Use CORE_PINNING for DL Streamer Pipeline Server deployments to simplify
configuration and enable portable deployments across different platforms.
Approach |
Pros |
Cons |
|---|---|---|
|
Declarative, container-native, works in Docker Compose/K8s, automatic core detection |
Specific to DL Streamer Pipeline Server |
Manual |
Universal (works with any application), explicit control |
Requires shell wrapper, harder to manage in orchestration, manual core discovery |
Combining Core Pinning with GPU/NPU Offload#
A common optimization pattern for AI pipelines:
Pin the Pipeline Server to E-cores — reduces CPU power consumption
Offload inference to GPU or NPU — leaves more power budget for accelerators
Monitor power distribution — verify GPU/NPU frequencies increase
Example Docker Compose with GPU + E-core pinning:
version: '3.8'
services:
dlstreamer-pipeline-server:
image: intel/dlstreamer-pipeline-server:2025.2.0-ubuntu22
environment:
CORE_PINNING: e-cores
DEVICE: GPU # Offload inference to GPU
devices:
- /dev/dri:/dev/dri
ports:
- "8080:8080"
volumes:
- ./pipelines:/home/pipeline-server/pipelines
Verify the optimization:
# Terminal 1: Monitor package power
sudo ./get_package_power.sh -i 120 -s 1
# Terminal 2: Start the pipeline server
docker-compose up
# Terminal 3: Run a pipeline
curl -X POST http://localhost:8080/pipelines/object_detection/1
Check that package power decreases while GPU power (visible in turbostat GFXWatt) increases
or remains stable.
For complete documentation on DL Streamer Pipeline Server core pinning, see the official guide.
Recommendations: Which Cores to Pin?#
The optimal core pinning strategy depends on your workload characteristics:
E-cores for throughput workloads.
P-cores for latency-constrained workloads.
LPE-cores for background tasks.
Use E-cores when:
Your workload is parallelizable and scales with core count
Throughput (tasks/second) matters more than individual task latency
You want to leave more power budget for GPU/NPU
Examples: video encoding, batch inference, data processing pipelines
Why E-cores?
More E-cores are available (typically 2-3× the number of P-cores)
Lower per-core power consumption allows more cores to run simultaneously
Leaves power headroom for GPU and NPU to maintain high frequencies
Better aggregate throughput per watt
Example:
# Video transcoding pipeline on E-cores
taskset -c 8-21 ffmpeg -i input.mp4 -c:v h264_vaapi -vf 'scale_vaapi=1920:1080' output.mp4
# Batch inference on NPU with E-cores handling preprocessing
taskset -c 8-21 python batch_inference.py --device NPU
# DL Streamer Pipeline Server for high-throughput video analytics
# (using CORE_PINNING environment variable)
CORE_PINNING=e-cores docker-compose up
Use P-cores when:
Low latency is critical (interactive applications, real-time control)
Single-threaded or lightly-threaded workloads
You need maximum per-thread performance
Examples: UI rendering, game engines, real-time analytics, control loops
Why P-cores?
Higher per-core clock speeds (often 2× E-core frequency)
Larger caches (L2 and shared L3)
Better single-threaded performance for latency-critical paths
Ideal for “main thread” logic that orchestrates parallel work
Example:
# Real-time object detection with DL Streamer
taskset -c 0-7 gst-launch-1.0 filesrc location=video.mp4 ! \
qtdemux ! h264parse ! vah264dec ! gvadetect model=yolov5.xml device=GPU ! \
gvafpscounter ! fakesink
# Industrial control loop on P-cores
taskset -c 0-3 ./motion_control_app --realtime
# DL Streamer Pipeline Server for low-latency inference
# (using CORE_PINNING environment variable)
CORE_PINNING=p-cores docker-compose up
Use LPE-cores when:
Tasks are low priority or non-latency-sensitive
You want to minimize interference with foreground workloads
Power efficiency is paramount
Examples: telemetry collection, logging, health checks
Example:
# Background telemetry agent on LPE-cores
taskset -c 22-25 ./telemetry_agent --interval 5s
# DL Streamer Pipeline Server for monitoring/logging pipelines
# (using CORE_PINNING environment variable)
CORE_PINNING=lp-cores docker-compose up
Summarry and Best Practices#
The tools provided in Intel’s open-edge-platform repositories — obtain_cores.sh for core
detection, get_package_power.sh for package power monitoring, and npu-monitor-tool.py for
NPU monitoring, combined with turbostat for detailed power tracking — give you everything you
need to implement effective core pinning strategies. For containerized AI workloads, the
DL Streamer Pipeline Server’s CORE_PINNING environment variable provides a declarative,
orchestration-friendly way to apply core affinity. Here are some recommendations on how to
proeed:
Profile first, optimize second: Use
get_package_power.sh,turbostat, andnpu-monitor-tool.pyto establish baselines before pinning.Match workload to core type:
Latency-sensitive → P-cores
Throughput-oriented → E-cores
Background tasks → LPE-cores
Leave cores idle when possible: Don’t spread workloads across all cores. Idle cores consume minimal power and leave more budget for accelerators.
Combine CPU pinning with GPU/NPU offload: For AI pipelines, pin CPU preprocessing to E-cores and run inference on GPU/NPU.
Use
CORE_PINNINGfor containerized workloads: When using DL Streamer Pipeline Server, prefer theCORE_PINNINGenvironment variable over manualtasksetwrappers.Monitor power distribution: Verify that your pinning strategy increases GPU/NPU power availability:
# Before pinning sudo ./get_package_power.sh -i 60 -s 1 > baseline.log & ./workload_no_pinning # After pinning sudo ./get_package_power.sh -i 60 -s 1 > optimized.log & taskset -c 8-15 ./workload_pinned # Compare results diff baseline.log optimized.log
Use CSV export for long-term analysis: Collect metrics over hours or days to understand power trends and workload characteristics.