Heterogeneous AI Inference on Intel Core Ultra processors#

Overview#

Modern Intel Core Ultra processors integrate CPU, GPU and NPU on a single SoC die, sharing a common power budget. Intel DL Streamer lets you assign individual GStreamer pipeline sub-streams to specific devices using the device= property on inference elements — enabling you to distribute workloads across all available compute engines simultaneously.

This page demonstrates one practical configuration: a heavily loaded GPU running multi-stream video analytics, while an additional stream is dispatched to the NPU. The goal is to explain the principles behind workload distribution and show how to build heterogeneous pipelines in DL Streamer. Actual throughput gains depend on your own combination of models, resolutions, and pipeline structure — the results shown here reflect one specific scenario and should not be generalized as universal rules.

Platform#

The Intel Core Ultra SoCs integrate CPU, GPU and NPU on a single die. All three compute engines share a common SoC power budget (TDP envelope).

The Intel integrated GPU on this platform is split into two tiles:

GT0 — render/compute tile: executes OpenVINO inference (OpenCL/SYCL)
GT1 — media tile: handles H.264/H.265 hardware video decode (VA-API) and post-processing

This separation is important: GT0 is busy with inference while GT1 independently handles video decode, leaving the CPU largely free for pipeline orchestration.

Workload Distribution Strategy#

Assigning workloads to the right compute engine is the key design decision. The principles below guided the configuration in this demo and can be applied to your own pipelines.

Route heavy workloads to GPU, lighter workloads to NPU#

The GPU typically offers more raw compute power and is best suited for large, complex models. The NPU is designed for sustained inference at lower power and excels at workloads that are lighter in compute but run continuously. In this demo the Large Vision Model occupies the GPU, while a detection model runs on the NPU.

There is no universal rule that maps model types to devices. The right split depends on how compute-heavy each workload is, how much capacity is available on each device, and how the SoC power budget is shared. Profile your own pipelines and adjust device= assignments accordingly.

Use NPU to relieve GPU contention#

When the GPU is already heavily loaded, adding more inference tasks to it causes competition for compute resources — existing workloads lose throughput, global frequencies may throttle, and the whole pipeline degrades. Offloading a suitable workload to the NPU adds compute capacity without competing with the GPU. The NPU runs in parallel and independently.

This demo demonstrates the effect concretely: with LVM and detection pipelines already occupying GT0, adding face detection on the GPU degrades license plate throughput. Moving face detection to the NPU fully recovers that loss while sustaining the extra stream.

Split independent tasks into separate sub-pipelines#

When multiple models process the same video source, avoid chaining them inline if their processing rates differ significantly. A slow model placed inline stalls faster downstream elements — for example, a large vision model would throttle a downstream detector capable of hundreds of fps.

Instead, use separate filesrc ! decodebin3 paths (or a tee element) for each task. In this demo, the LVM alert pipeline and the license plate pipeline each have their own decode path from the same source file, so the LVM’s low cadence does not limit the license plate detector.

Hardware video decode via decodebin3#

decodebin3 automatically selects the best available decoder. On platforms with an Intel iGPU using the xe driver, it resolves to VA-API hardware decode running on GT1 (the GPU media tile). Each active sub-pipeline opens its own decode context; with three parallel sub-pipelines, three concurrent decode sessions run on GT1.

HW decode on GT1 is fast and leaves GT0 free for inference. At high stream counts or high resolutions, GT1 can become a bottleneck — monitor GT1 utilization to detect this condition.

Optimize sub-pipelines with the DL Streamer Optimizer#

The DL Streamer Optimizer can automatically tune inference parameters such as nireq, batch-size for each sub-pipeline and target device. A standalone NPU workload that runs independently is a particularly good candidate: its parameters can be tuned in isolation without being constrained by adjacent GPU tasks.

Example Scenario#

This demo uses two video files representing different camera feeds in a city surveillance scenario:

Role	Video	Resolution	Frame Rate
Main	Police highway scene	1280×720	60 fps
Extra	Metro crowd scene	1280×720	30 fps

Main video: Pexels #2103099
Extra video: Pexels #18553046

Pipeline Design#

The full pipeline is composed of three independently running sub-pipelines launched in a single gst-launch-1.0 command. Sub-pipelines A and B process the same source video via separate decode paths; sub-pipeline C processes a different video feed.

Sub-pipeline A — Large Vision Model (LVM) alert detection#

Runs a large vision model at a low frame rate (0.5 fps) to answer a natural-language question about the video content. This is the heaviest inference task and the primary driver of GT0 utilization.

Based on the DL Streamer VLM alerts sample.

filesrc location=Videos/police_highway_1280_720_60fps_loop10.mp4 ! decodebin3
  ! gvagenai model-path=models/InternVL3_5-2B device=GPU
      prompt="Is there a police car? Answer yes or no."
      generation-config=max_new_tokens=1,num_beams=4
      chunk-size=1 frame-rate=0.5 metrics=true
  ! queue ! gvafpscounter ! fakesink sync=false

Example output:

LVM alert detection output

Sub-pipeline B — License plate detection + OCR#

Runs at full video frame rate: object detection followed by OCR on cropped plate regions. Runs in parallel with sub-pipeline A on its own decode path — the LVM’s low cadence does not limit its throughput.

filesrc location=Videos/police_highway_1280_720_60fps_loop10.mp4 ! decodebin3
  ! gvadetect model=models/yolov8_license_plate_detector/FP32/yolov8_license_plate_detector.xml
      device=GPU batch-size=2 model-instance-id=inf1
  ! queue
  ! gvaclassify model=models/ch_PP-OCRv4_rec_infer/FP32/ch_PP-OCRv4_rec_infer.xml
      device=GPU model-instance-id=inf2
  ! queue ! gvafpscounter ! fakesink sync=false

Example output:

License plate detection and OCR output

Sub-pipeline C — Face detection (extra workload)#

Processes a separate video feed. This sub-pipeline demonstrates GPU vs NPU dispatch: changing only the device= (and optionally other parameters) routes the same model to a different compute engine.

Based on the DL Streamer face detection and classification sample.

GPU variant (Case 2):

filesrc location=Videos/metro_1280_720_30fps_loop10.mp4 ! decodebin3
  ! gvadetect model=models/YOLOv8-Face-Detection/INT8/YOLOv8-Face-Detection.xml
      device=GPU batch-size=2 model-instance-id=inf3
  ! queue ! gvafpscounter ! fakesink sync=false

NPU variant (Case 3):

filesrc location=Videos/metro_1280_720_30fps_loop10.mp4 ! decodebin3
  ! gvadetect model=models/YOLOv8-Face-Detection/INT8/YOLOv8-Face-Detection.xml
      device=NPU nireq=2 batch-size=2 model-instance-id=inf3
  ! queue ! gvafpscounter ! fakesink sync=false

Example output (NPU):

Face detection output

Test Cases#

Three configurations are compared by changing only which device runs sub-pipeline C.

Case 1 — Main workload only (GPU baseline)#

Sub-pipelines A and B run on GPU. No extra workload.

Compute allocation:

Workload	Device
LVM (gvagenai)	GPU GT0
License plate detection + OCR	GPU GT0
Video decode (2 streams)	GPU GT1

Pipeline architecture:

        flowchart LR
    Video1["Police Highway Video"]
    
    Video1 -.-> V1A[filesrc]
    Video1 -.-> V1B[filesrc]
    
    V1A --> D1
    V1B --> D2
    
    subgraph GT1["GPU GT1 Media Tile (Shared HW Decode)"]
        D1[decodebin3]
        D2[decodebin3]
    end
    
    subgraph SubA["Sub-pipeline A: LVM Alerts"]
        D1 --> LVM["gvagenai<br/>device=GPU<br/>InternVL3_5-2B"] --> Q1[queue] --> F1[fakesink]
    end
    
    subgraph SubB["Sub-pipeline B: License Plate Detection + OCR"]
        D2 --> DET["gvadetect<br/>device=GPU<br/>YOLOv8 LP"] --> Q2[queue] --> OCR["gvaclassify<br/>device=GPU<br/>PP-OCRv4"] --> Q3[queue] --> F2[fakesink]
    end
    
    style GT1 fill:#fff4e6,stroke:#ff9800,stroke-width:3px

Observed behavior:

GPU compute tile (GT0) is heavily utilized by the combined LVM and detection workloads
LVM dominates GT0 time; license plate detection runs alongside it, sharing GPU resources
GPU operates at high frequency for this workload configuration
CPU involvement is low — mostly GStreamer pipeline management
NPU is idle

Case 2 — Main + face detection, all on GPU#

Sub-pipelines A, B and C (GPU variant) run together.

Compute allocation:

Workload	Device
LVM (gvagenai)	GPU GT0
License plate detection + OCR	GPU GT0
Face detection (metro video)	GPU GT0
Video decode (3 streams)	GPU GT1

Pipeline architecture:

        flowchart LR
    Video1["Police Highway Video"]
    Video2["Metro Crowd Video"]
    
    Video1 -.-> V1A[filesrc]
    Video1 -.-> V1B[filesrc]
    Video2 -.-> V2[filesrc]
    
    V1A --> D1
    V1B --> D2
    V2 --> D3
    
    subgraph GT1["GPU GT1 Media Tile (Shared HW Decode)"]
        D1[decodebin3]
        D2[decodebin3]
        D3[decodebin3]
    end
    
    subgraph SubA["Sub-pipeline A: LVM Alerts"]
        D1 --> LVM["gvagenai<br/>device=GPU<br/>InternVL3_5-2B"] --> Q1[queue] --> F1[fakesink]
    end
    
    subgraph SubB["Sub-pipeline B: License Plate Detection + OCR"]
        D2 --> DET["gvadetect<br/>device=GPU<br/>YOLOv8 LP"] --> Q2[queue] --> OCR["gvaclassify<br/>device=GPU<br/>PP-OCRv4"] --> Q3[queue] --> F2[fakesink]
    end
    
    subgraph SubC["Sub-pipeline C: Face Detection"]
        D3 --> FACE["gvadetect<br/>device=GPU<br/>YOLOv8 Face"] --> Q4[queue] --> F3[fakesink]
    end
    
    style GT1 fill:#fff4e6,stroke:#ff9800,stroke-width:3px

Observed behavior:

GT0 must now share resources among three competing inference tasks
The license plate sub-pipeline loses throughput to the added face detection workload; the degradation is significant
GPU frequency may throttle as the combined compute demand pushes the shared SoC power budget
NPU remains idle

Case 3 — Main on GPU, face detection on NPU (heterogeneous)#

Sub-pipelines A and B remain on GPU. Sub-pipeline C uses the NPU variant.

Compute allocation:

Workload	Device
LVM (gvagenai)	GPU GT0
License plate detection + OCR	GPU GT0
Face detection (metro video)	NPU
Video decode (3 streams)	GPU GT1

Pipeline architecture:

        flowchart LR
    Video1["Police Highway Video"]
    Video2["Metro Crowd Video"]
    
    Video1 -.-> V1A[filesrc]
    Video1 -.-> V1B[filesrc]
    Video2 -.-> V2[filesrc]
    
    V1A --> D1
    V1B --> D2
    V2 --> D3
    
    subgraph GT1["GPU GT1 Media Tile (Shared HW Decode)"]
        D1[decodebin3]
        D2[decodebin3]
        D3[decodebin3]
    end
    
    subgraph SubA["Sub-pipeline A: LVM Alerts"]
        D1 --> LVM["gvagenai<br/>device=GPU<br/>InternVL3_5-2B"] --> Q1[queue] --> F1[fakesink]
    end
    
    subgraph SubB["Sub-pipeline B: License Plate Detection + OCR"]
        D2 --> DET["gvadetect<br/>device=GPU<br/>YOLOv8 LP"] --> Q2[queue] --> OCR["gvaclassify<br/>device=GPU<br/>PP-OCRv4"] --> Q3[queue] --> F2[fakesink]
    end
    
    subgraph SubC["Sub-pipeline C: Face Detection"]
        D3 --> FACE["gvadetect<br/>device=NPU<br/>YOLOv8 Face"] --> Q4[queue] --> F3[fakesink]
    end
    
    style GT1 fill:#fff4e6,stroke:#ff9800,stroke-width:3px

Observed behavior:

GT0 is no longer shared with face detection — license plate throughput recovers to Case 1 levels
NPU handles face detection independently, in parallel with GPU inference, adding capacity without affecting the GPU workloads
GPU frequency is slightly lower than Case 1: the active NPU draws part of the shared SoC power budget — this is expected and does not negate the throughput advantage
GT1 utilization increases noticeably compared to Case 2: with the NPU processing frames quickly there is no backpressure on the decoder, which now sustains higher decode throughput
Total throughput across all three streams is the highest of all cases; package power is the lowest

Results Summary#

	Case 1: MAIN/GPU	Case 2: MAIN+EXTRA/GPU	Case 3: MAIN+EXTRA/GPU+NPU
Extra workload	—	GPU	NPU
Total throughput	baseline	moderate gain	highest
LP detection throughput	baseline	significantly lower	fully recovered
Package power	highest	lower than Case 1	lowest
GT0 utilization	high	high	high
GT0 frequency	highest	lower	slightly lower than Case 1
GT1 utilization	low	moderate	high
NPU utilization	idle	idle	active
CPU utilization	low	higher	moderate

Key Takeaways#

The following observations from the test cases reflect the design principles described in the Workload Distribution Strategy section.

1. NPU enables extra compute capacity when GPU is loaded#

When the GPU is carrying a heavy workload, offloading a suitable task to the NPU adds throughput without competing with GPU inference. This demo shows the effect directly: face detection moved from GPU to NPU preserves the license plate pipeline at full speed while sustaining the additional stream. How much benefit you gain depends on how loaded the GPU already is and how well the offloaded workload fits the NPU.

2. GPU handles heavier workloads; NPU handles lighter continuous workloads#

In this demo the GPU runs a large vision model and a detect+classify chain (heavier compute), while the NPU runs a single detection model (lighter compute). This split works for this particular scenario. For different models or resolutions, the balance point shifts — measure your own workloads before deciding on a device assignment.

3. Parallel sub-pipelines decouple tasks with different processing rates#

By running the LVM alert pipeline and the license plate pipeline as separate sub-pipelines (each with its own decode path), the slow LVM cadence does not throttle the faster detection pipeline. This pattern applies whenever you have models with very different processing rates sharing the same video source.

5. Hardware video decode scales with stream count; monitor GT1#

decodebin3 resolves to VA-API hardware decode on GT1, which is efficient and leaves GT0 free for inference. As stream count grows, GT1 utilization rises. High GT1 utilization can indicate a fully utilized, well-balanced system — or a decode bottleneck limiting downstream inference. Observing GT1 alongside GT0 helps distinguish these cases.

6. Optimize individual sub-pipelines with the DL Streamer Optimizer#

Each sub-pipeline can be tuned independently using the DL Streamer Optimizer. A standalone NPU sub-pipeline is a good starting point: its nireq and batch-size can be tuned to maximize NPU throughput without affecting the adjacent GPU workloads.

Notes and Caveats#

GPU tile utilization (GT0/GT1) is measured via gtidle/idle_residency_ms sysfs counters (xe driver). These counters reflect periods when the entire tile is clock-gated idle. Short scheduling gaps between independent pipelines appear as idle time, so reported utilization may be slightly lower than actual compute activity.
NPU frequency is read from sysfs immediately after pipeline completion, before the NPU returns to idle.

Heterogeneous AI Inference on Intel Core Ultra processors#

Overview#

Platform#

Workload Distribution Strategy#

Route heavy workloads to GPU, lighter workloads to NPU#

Use NPU to relieve GPU contention#

Split independent tasks into separate sub-pipelines#

Hardware video decode via decodebin3#

Optimize sub-pipelines with the DL Streamer Optimizer#

Example Scenario#

Pipeline Design#

Sub-pipeline A — Large Vision Model (LVM) alert detection#

Sub-pipeline B — License plate detection + OCR#

Sub-pipeline C — Face detection (extra workload)#

Test Cases#

Case 1 — Main workload only (GPU baseline)#

Case 2 — Main + face detection, all on GPU#

Case 3 — Main on GPU, face detection on NPU (heterogeneous)#

Results Summary#

Key Takeaways#

1. NPU enables extra compute capacity when GPU is loaded#

2. GPU handles heavier workloads; NPU handles lighter continuous workloads#

3. Parallel sub-pipelines decouple tasks with different processing rates#

5. Hardware video decode scales with stream count; monitor GT1#

6. Optimize individual sub-pipelines with the DL Streamer Optimizer#

Notes and Caveats#

References#

This Page

Heterogeneous AI Inference on Intel Core Ultra processors#

Overview#

Platform#

Workload Distribution Strategy#

Route heavy workloads to GPU, lighter workloads to NPU#

Use NPU to relieve GPU contention#

Split independent tasks into separate sub-pipelines#

Hardware video decode via decodebin3#

Optimize sub-pipelines with the DL Streamer Optimizer#

Example Scenario#

Pipeline Design#

Sub-pipeline A — Large Vision Model (LVM) alert detection#

Sub-pipeline B — License plate detection + OCR#

Sub-pipeline C — Face detection (extra workload)#

Test Cases#

Case 1 — Main workload only (GPU baseline)#

Case 2 — Main + face detection, all on GPU#

Case 3 — Main on GPU, face detection on NPU (heterogeneous)#

Results Summary#

Key Takeaways#

1. NPU enables extra compute capacity when GPU is loaded#

2. GPU handles heavier workloads; NPU handles lighter continuous workloads#

3. Parallel sub-pipelines decouple tasks with different processing rates#

4. GPU and NPU share the SoC power budget#

5. Hardware video decode scales with stream count; monitor GT1#

6. Optimize individual sub-pipelines with the DL Streamer Optimizer#

Notes and Caveats#

References#

This Page