Critical Importance of Core Pinning on Intel Edge Platforms#

Today’s Intel edge processors are designed around a fundamental principle: power is a shared, finite resource. The processor’s total power budget (package power) is dynamically distributed between: CPU cores (P-cores, E-cores, and LPE-cores), Uncore components including the GPU, NPU (Neural Processing Unit), and Memory controllers and I/O.

With proper core pinning, you can precisely control which cores are active, preventing the operating system’s default scheduler from spreading your application across all available cores, for example, when a single-threaded application wakes up multiple cores unnecessarily.

Core pinning mitigates such problems as:

  • Increased power consumption from activating unnecessary cores.

  • Reduced turbo frequencies as the processor throttles to stay within thermal limits.

  • GPU and NPU power starvation when CPU cores consume the bulk of the power budget.

  • Cache pollution and memory bandwidth contention from thread migration.

On modern hybrid processors like Intel’s Arrow Lake, Lunar Lake, and Panther Lake platforms, the effect of core pinning is even more pronounced.

  • Heterogeneous core types: P-cores (Performance), E-cores (Efficient), and LPE-cores (Low Power Efficient) have drastically different power and performance characteristics.

  • Integrated accelerators: GPU and NPU share the same package power budget with CPU cores.

  • AI workloads: Vision inference, video analytics, and ML pipelines often combine CPU, GPU, and NPU—power competition becomes critical.

Tools for Detection and Monitoring#

1. Detecting Core Types: obtain_cores.sh#

Intel provides a comprehensive script to detect and enumerate core types on hybrid Intel platforms. The script is available in the edge-workloads-and-benchmarks repository.

Location: utils/obtain_cores.sh

Usage:

cd utils/
./obtain_cores.sh

Example output:

pcore:0,1,2,3,4,5,6,7
ecore:8,9,10,11,12,13,14,15,16,17,18,19,20,21
lpecore:22,23,24,25

This script uses multiple detection methods with fallbacks:

  1. Multi-socket Xeon detection — assigns all cores as P-cores on server platforms.

  2. CPUID-based detection — uses the cpuid instruction to identify Intel Core vs Intel Atom cores.

  3. sysfs validation — reads /sys/devices/cpu_core/, /sys/devices/cpu_atom/, /sys/devices/cpu_lowpower/.

  4. L1d cache drop detection — identifies E-cores by detecting cache size transitions.

  5. SMT pair detection — classifies remaining cores based on hyperthreading topology.

The script outputs comma-separated core IDs for each type, which can be directly used with the taskset command to pin workloads.

Pinning examples:

# Pin to P-cores only (for latency-sensitive workloads)
taskset -c 0,1,2,3,4,5,6,7 ./your_application

# Pin to E-cores only (for throughput workloads)
taskset -c 8,9,10,11,12,13,14,15,16,17,18,19,20,21 ./your_application

# Pin to LPE-cores (for background tasks)
taskset -c 22,23,24,25 ./background_service

For full documentation and usage examples, see the utils README.

2. Power Monitoring Tools#

To verify that core pinning is actually improving your power efficiency and performance, you need to monitor power consumption across all compute resources.

get_package_power.sh — Package Power Monitoring via RAPL and hwmon#

Intel provides a dedicated script to sample platform package power consumption using RAPL (Running Average Power Limit) sysfs interfaces and hardware monitoring sensors. The script is available in the edge-workloads-and-benchmarks repository.

Location: utils/get_package_power.sh

Usage:

sudo ./get_package_power.sh -i <duration (seconds)> -s <sampling interval (seconds)> -d <delay (seconds)>

Options:

  • -s <seconds> — Sampling interval in seconds (default: 1)

  • -i <seconds> — Total duration in seconds (default: 60)

  • -d <seconds> — Start delay in seconds (default: 0)

Example:

# Measure package power for 60 seconds at 1-second intervals
sudo ./get_package_power.sh -i 60 -s 1

# Measure for 30 seconds with a 5-second delay before starting
sudo ./get_package_power.sh -i 30 -s 1 -d 5

Example output:

[ Info ] Monitoring for 60s after a 0s delay

[rapl] card0 (xe @ 0000:00:02.0): 15.23 W
[rapl] card0 (xe @ 0000:00:02.0): 14.87 W
[rapl] card0 (xe @ 0000:00:02.0): 15.45 W
...

[ Info ] Monitoring complete

Output format:

[source] card# (driver @ pci): power W
  • source: Either rapl (RAPL energy counters) or hwmon (hardware monitoring sensors)

  • card#: DRM card identifier (e.g., card0)

  • driver: Graphics driver (i915 or xe)

  • pci: PCI device address

  • power: Instantaneous power consumption in watts

How it works:

  1. Discovers Intel graphics devices via /sys/class/drm/card*/ (i915 or xe drivers)

  2. Detects power sensors using either:

    • Hardware monitoring sensors (hwmon) with package/card energy or power labels

    • RAPL energy counters (/sys/class/powercap/intel-rapl:*/energy_uj)

  3. Samples power by reading energy counters at the start and end of each interval, computing power as:

    power (W) = (end_energy - start_energy) / interval_duration
    

Use cases:

  • Compare before/after core pinning: Run the script during your workload with and without core pinning to quantify power savings

  • Monitor GPU power availability: Check if freeing up CPU cores allows the GPU to consume more power (higher frequency)

  • Long-term profiling: Use with longer durations to understand power patterns over time

Example workflow:

# Baseline: workload without core pinning
sudo ./get_package_power.sh -i 60 -s 1 &
./my_workload

# Optimized: workload with E-core pinning
sudo ./get_package_power.sh -i 60 -s 1 &
taskset -c 8-21 ./my_workload

For full documentation, see the utils README.

turbostat — Detailed Package, Core, and Graphics Power#

Linux’s built-in turbostat utility provides more granular power breakdowns:

sudo turbostat --interval 1

Key metrics to watch:

  • PkgWatt: Total package power (CPU + GPU + uncore)

  • CorWatt: Power consumed by CPU cores only

  • GFXWatt: Graphics (GPU) power consumption

  • RAMWatt: DRAM power

Example output snippet:

Core CPU  Avg_MHz Busy%  Bzy_MHz  PkgWatt  CorWatt  GFXWatt
-    -    2100    50.0   4200     25.0     15.0     5.0
0    0    4200    100.0  4200
1    1    4200    100.0  4200

By comparing power metrics before and after core pinning, you can quantify the impact on package power and GPU power availability. Use get_package_power.sh for simple package-level measurements, and turbostat when you need detailed per-core and component-level breakdowns.

npu-monitor-tool.py — NPU Power and Utilization#

For workloads using the Intel NPU (Neural Processing Unit), monitor NPU-specific metrics using the NPU monitoring tool from the edge-ai-libraries repository.

Location: tools/npu-monitor-tool/npu-monitor-tool.py

Usage:

sudo python3 npu-monitor-tool.py -i 1000

Example output:

+-----------------------------------------------------------------------------------------------+
| INTEL NPU Device: 0x7d1d   | version: 1.0.0                                                   |
| Firmware version: IVPU_MTL_20240112_v2024.01                                                  |
+===============================================================================================+
|       Power Usage        |      DPU Freq        | NPU DDR Average Bandwidth   |    Tile Conf  |
|                2.5 [W]   |        1400 [Hz]     |               123.45 [MB/s] |             4 |
+===============================================================================================+
|       NPU Temperature    |      NPU Utilization       |      Memory Usage                     |
|              45 [°C]     |                      25%   |                         512.00 [MB]   |
+-----------------------------------------------------------------------------------------------+

CSV export is available for long-term analysis:

sudo python3 npu-monitor-tool.py --csv -i 1000

This generates timestamped CSV files in npu_output/ with the following columns:

timestamp, power, frequency, bandwidth, tile_config, temperature, utilization, memory_usage

For complete documentation, see the npu-monitor-tool README.

Core Pinning with DL Streamer Pipeline Server#

For AI video analytics workloads using the DL Streamer Pipeline Server, Intel provides built-in support for core pinning via the CORE_PINNING environment variable. This eliminates the need to manually wrap the server with taskset and provides a declarative way to specify core affinity in Docker Compose or Kubernetes deployments.

Using the CORE_PINNING Environment Variable#

The CORE_PINNING environment variable accepts two types of values:

  1. Explicit core list or range (taskset-compatible syntax):

    • Comma-delimited list: 10,12,14

    • Range: 10-14

    • Range with step: 10-14/2 (cores 10, 12, 14)

  2. Core type specification (automatic detection):

    • p-cores — Pin to Performance cores

    • e-cores — Pin to Efficient cores

    • lp-cores — Pin to Low Power Efficient cores

The server automatically detects the appropriate cores using the same detection logic as obtain_cores.sh and applies taskset internally.

Docker Compose Example#

version: '3.8'
services:
  dlstreamer-pipeline-server:
    image: intel/dlstreamer-pipeline-server:2025.2.0-ubuntu22
    environment:
      CORE_PINNING: p-cores
    devices:
      - /dev/dri:/dev/dri
    ports:
      - "8080:8080"
    volumes:
      - ./pipelines:/home/pipeline-server/pipelines
version: '3.8'
services:
  dlstreamer-pipeline-server:
    image: intel/dlstreamer-pipeline-server:2025.2.0-ubuntu22
    environment:
      CORE_PINNING: e-cores
    devices:
      - /dev/dri:/dev/dri
    ports:
      - "8080:8080"
    volumes:
      - ./pipelines:/home/pipeline-server/pipelines
version: '3.8'
services:
  dlstreamer-pipeline-server:
    image: intel/dlstreamer-pipeline-server:2025.2.0-ubuntu22
    environment:
      CORE_PINNING: "8-15"  # E-cores 8 through 15
    devices:
      - /dev/dri:/dev/dri
    ports:
      - "8080:8080"
    volumes:
      - ./pipelines:/home/pipeline-server/pipelines

Kubernetes Example#

For Kubernetes deployments, set the environment variable in the pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: dlstreamer-pipeline-server
spec:
  containers:
  - name: dlstreamer
    image: intel/dlstreamer-pipeline-server:2025.2.0-ubuntu22
    env:
    - name: CORE_PINNING
      value: "p-cores"
    resources:
      limits:
        gpu.intel.com/i915: 1

CORE_PINNING vs Manual taskset#

Recommendation: Use CORE_PINNING for DL Streamer Pipeline Server deployments to simplify configuration and enable portable deployments across different platforms.

Approach

Pros

Cons

CORE_PINNING env var

Declarative, container-native, works in Docker Compose/K8s, automatic core detection

Specific to DL Streamer Pipeline Server

Manual taskset

Universal (works with any application), explicit control

Requires shell wrapper, harder to manage in orchestration, manual core discovery

Combining Core Pinning with GPU/NPU Offload#

A common optimization pattern for AI pipelines:

  1. Pin the Pipeline Server to E-cores — reduces CPU power consumption

  2. Offload inference to GPU or NPU — leaves more power budget for accelerators

  3. Monitor power distribution — verify GPU/NPU frequencies increase

Example Docker Compose with GPU + E-core pinning:

version: '3.8'
services:
  dlstreamer-pipeline-server:
    image: intel/dlstreamer-pipeline-server:2025.2.0-ubuntu22
    environment:
      CORE_PINNING: e-cores
      DEVICE: GPU  # Offload inference to GPU
    devices:
      - /dev/dri:/dev/dri
    ports:
      - "8080:8080"
    volumes:
      - ./pipelines:/home/pipeline-server/pipelines

Verify the optimization:

# Terminal 1: Monitor package power
sudo ./get_package_power.sh -i 120 -s 1

# Terminal 2: Start the pipeline server
docker-compose up

# Terminal 3: Run a pipeline
curl -X POST http://localhost:8080/pipelines/object_detection/1

Check that package power decreases while GPU power (visible in turbostat GFXWatt) increases or remains stable.

For complete documentation on DL Streamer Pipeline Server core pinning, see the official guide.

Recommendations: Which Cores to Pin?#

The optimal core pinning strategy depends on your workload characteristics:

  • E-cores for throughput workloads.

  • P-cores for latency-constrained workloads.

  • LPE-cores for background tasks.

Use E-cores when:

  • Your workload is parallelizable and scales with core count

  • Throughput (tasks/second) matters more than individual task latency

  • You want to leave more power budget for GPU/NPU

  • Examples: video encoding, batch inference, data processing pipelines

Why E-cores?

  • More E-cores are available (typically 2-3× the number of P-cores)

  • Lower per-core power consumption allows more cores to run simultaneously

  • Leaves power headroom for GPU and NPU to maintain high frequencies

  • Better aggregate throughput per watt

Example:

# Video transcoding pipeline on E-cores
taskset -c 8-21 ffmpeg -i input.mp4 -c:v h264_vaapi -vf 'scale_vaapi=1920:1080' output.mp4

# Batch inference on NPU with E-cores handling preprocessing
taskset -c 8-21 python batch_inference.py --device NPU

# DL Streamer Pipeline Server for high-throughput video analytics
# (using CORE_PINNING environment variable)
CORE_PINNING=e-cores docker-compose up

Use P-cores when:

  • Low latency is critical (interactive applications, real-time control)

  • Single-threaded or lightly-threaded workloads

  • You need maximum per-thread performance

  • Examples: UI rendering, game engines, real-time analytics, control loops

Why P-cores?

  • Higher per-core clock speeds (often 2× E-core frequency)

  • Larger caches (L2 and shared L3)

  • Better single-threaded performance for latency-critical paths

  • Ideal for “main thread” logic that orchestrates parallel work

Example:

# Real-time object detection with DL Streamer
taskset -c 0-7 gst-launch-1.0 filesrc location=video.mp4 ! \
    qtdemux ! h264parse ! vah264dec ! gvadetect model=yolov5.xml device=GPU ! \
    gvafpscounter ! fakesink

# Industrial control loop on P-cores
taskset -c 0-3 ./motion_control_app --realtime

# DL Streamer Pipeline Server for low-latency inference
# (using CORE_PINNING environment variable)
CORE_PINNING=p-cores docker-compose up

Use LPE-cores when:

  • Tasks are low priority or non-latency-sensitive

  • You want to minimize interference with foreground workloads

  • Power efficiency is paramount

  • Examples: telemetry collection, logging, health checks

Example:

# Background telemetry agent on LPE-cores
taskset -c 22-25 ./telemetry_agent --interval 5s

# DL Streamer Pipeline Server for monitoring/logging pipelines
# (using CORE_PINNING environment variable)
CORE_PINNING=lp-cores docker-compose up

Summarry and Best Practices#

The tools provided in Intel’s open-edge-platform repositories — obtain_cores.sh for core detection, get_package_power.sh for package power monitoring, and npu-monitor-tool.py for NPU monitoring, combined with turbostat for detailed power tracking — give you everything you need to implement effective core pinning strategies. For containerized AI workloads, the DL Streamer Pipeline Server’s CORE_PINNING environment variable provides a declarative, orchestration-friendly way to apply core affinity. Here are some recommendations on how to proeed:

  1. Profile first, optimize second: Use get_package_power.sh, turbostat, and npu-monitor-tool.py to establish baselines before pinning.

  2. Match workload to core type:

    • Latency-sensitive → P-cores

    • Throughput-oriented → E-cores

    • Background tasks → LPE-cores

  3. Leave cores idle when possible: Don’t spread workloads across all cores. Idle cores consume minimal power and leave more budget for accelerators.

  4. Combine CPU pinning with GPU/NPU offload: For AI pipelines, pin CPU preprocessing to E-cores and run inference on GPU/NPU.

  5. Use CORE_PINNING for containerized workloads: When using DL Streamer Pipeline Server, prefer the CORE_PINNING environment variable over manual taskset wrappers.

  6. Monitor power distribution: Verify that your pinning strategy increases GPU/NPU power availability:

    # Before pinning
    sudo ./get_package_power.sh -i 60 -s 1 > baseline.log &
    ./workload_no_pinning
    
    # After pinning
    sudo ./get_package_power.sh -i 60 -s 1 > optimized.log &
    taskset -c 8-15 ./workload_pinned
    
    # Compare results
    diff baseline.log optimized.log
    
  7. Use CSV export for long-term analysis: Collect metrics over hours or days to understand power trends and workload characteristics.


Additional Resources#