# GPU and NPU Device Plugins Usage Guide ## Overview Intel GPU and NPU device plugins expose hardware accelerators to Kubernetes pods without requiring privileged containers. The plugins register devices with the kubelet via the Kubernetes Device Plugin framework, allowing pods to request accelerators through standard resource limits — `gpu.intel.com/xe` for GPU and `npu.intel.com/accel` for NPU. The Edge Node Infrastructure Blueprint pre-installs these plugins during first boot. This guide covers verification, pod scheduling, and common usage patterns. | Resource Name | Plugin | Hardware | Use Case | |---------------|--------|----------|----------| | `gpu.intel.com/xe` | Intel GPU device plugin | Intel Xe-based integrated/discrete GPU (including SR-IOV VFs) | Media processing, inference, rendering | | `npu.intel.com/accel` | Intel NPU device plugin | Intel NPU 2000/3000/4000 series | Low-power AI inference | --- ## Prerequisites - Edge Node Infrastructure Blueprint image deployed with Kubernetes (K3s) host type - Intel GPU and/or NPU hardware present - K3s running with device plugins installed --- ## Step 1: Verify Plugin Installation After first boot, confirm that the device plugin pods are running: ```bash # Check device plugin pods kubectl get pods -n intel-device-plugins ``` Check all pods across namespaces: ```bash sudo kubectl get pods -A ``` Expected healthy output includes the running Intel and Node Feature Discovery components: ```text intel-device-plugins intel-gpu-plugin-xxxxx 1/1 Running intel-device-plugins intel-npu-plugin-xxxxx 1/1 Running node-feature-discovery nfd-master-xxxxx 1/1 Running node-feature-discovery nfd-worker-xxxxx 1/1 Running ``` --- ## Step 2: Verify Allocatable Resources Confirm that the node advertises GPU and NPU resources: ```bash kubectl describe node | grep -A 20 "Allocatable:" ``` Expected output (example with SR-IOV VFs enabled): ```text Allocatable: cpu: xx ephemeral-storage: xxx gpu.intel.com/monitoring: x gpu.intel.com/xe: x memory: xxx npu.intel.com/accel: x pods: xxx ``` The `gpu.intel.com/xe` count reflects the number of SR-IOV Virtual Functions available for allocation. You can also verify the node labels applied by NFD: ```bash kubectl get nodes --show-labels | tr ',' '\n' | grep intel ``` --- ## Step 3: Run a Pod with GPU Access Create a pod that requests one Intel GPU device: ```yaml apiVersion: v1 kind: Pod metadata: name: gpu-test namespace: default spec: restartPolicy: Never containers: - name: gpu-check image: ubuntu:24.04 command: ["sh", "-c", "ls -la /dev/dri/ && echo 'GPU accessible'"] resources: limits: gpu.intel.com/xe: "1" ``` Apply and check output: ```bash kubectl apply -f gpu-test.yaml kubectl wait pod/gpu-test --for=jsonpath='{.status.phase}'=Succeeded --timeout=60s kubectl logs gpu-test ``` Expected output: ```text crw-rw---- 1 root render 226, 129 Jun 17 10:00 renderD129 GPU accessible ``` --- ## Step 4: Run a Pod with NPU Access Create a pod that requests the Intel NPU device: ```yaml apiVersion: v1 kind: Pod metadata: name: npu-test namespace: default spec: restartPolicy: Never containers: - name: npu-check image: ubuntu:24.04 command: ["sh", "-c", "ls -la /dev/accel/ && echo 'NPU accessible'"] resources: limits: npu.intel.com/accel: "1" ``` Apply and check output: ```bash kubectl apply -f npu-test.yaml kubectl wait pod/npu-test --for=jsonpath='{.status.phase}'=Succeeded --timeout=60s kubectl logs npu-test ``` Expected output: ```text crw-rw---- 1 root render 261, 0 Jun 17 10:00 accel0 NPU accessible ``` --- ## Step 5: Run a Pod with Both GPU and NPU Request both accelerators in a single pod for combined inference workloads: ```yaml apiVersion: v1 kind: Pod metadata: name: gpu-npu-test namespace: default spec: restartPolicy: Never containers: - name: accel-check image: ubuntu:24.04 command: ["sh", "-c", "ls /dev/dri/ && ls /dev/accel/ && echo 'Both accelerators accessible'"] resources: limits: gpu.intel.com/xe: "1" npu.intel.com/accel: "1" ``` --- ## Using GPU/NPU in Deployments For production workloads, use a Deployment to manage GPU or NPU pods: ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: inference-app namespace: default spec: replicas: 2 selector: matchLabels: app: inference template: metadata: labels: app: inference spec: containers: - name: inference image: ubuntu:24.04 command: ["sh", "-c", "echo 'GPU device available:' && ls /dev/dri/ && sleep infinity"] resources: limits: gpu.intel.com/xe: "1" ``` The scheduler only places pods on nodes that have available GPU or NPU resources. If the node has 7 SR-IOV VFs, up to 7 pods can each receive one GPU device. --- ## Troubleshooting ### No GPU/NPU Resources on the Node 1. Check that the device plugin pods are running: ```bash kubectl get pods -n intel-device-plugins kubectl logs -n intel-device-plugins -l app=intel-gpu-plugin ``` 2. Verify the hardware is detected by the host: ```bash ls /dev/dri/ # GPU devices ls /dev/accel/ # NPU devices ``` 3. Verify NFD labels are applied: ```bash kubectl get nodes --show-labels | tr ',' '\n' | grep 'gpu.intel.com' kubectl get nodes --show-labels | tr ',' '\n' | grep 'npu.intel.com' ``` ### Pod Stuck in Pending State If a pod requesting GPU/NPU is stuck in `Pending`: ```bash kubectl describe pod ``` Common causes: - **Insufficient resources**: All GPU/NPU devices are already allocated to other pods - **No matching node**: The node does not have the requested hardware - **Plugin not running**: The device plugin pod crashed or was not installed ### GPU Plugin Shows 0 Devices If SR-IOV VFs are expected but not detected: ```bash cat /sys/bus/pci/devices/0000:00:02.0/sriov_numvfs ls /dev/dri/renderD* ``` If VFs are not created, the SR-IOV service may not have run. Check: ```bash sudo systemctl status intel-sriov-vf.service sudo journalctl -u intel-sriov-vf.service --no-pager ``` ### NPU Plugin Shows 0 Devices Verify the NPU driver is loaded: ```bash ls /sys/bus/pci/drivers/intel_vpu/ lsmod | grep intel_vpu ``` If not loaded: ```bash sudo modprobe intel_vpu ``` For kernel 6.17+, verify firmware: ```bash ls /lib/firmware/intel/vpu/ dmesg | grep -i vpu ``` --- ## References - [Intel Device Plugins for Kubernetes](https://github.com/intel/intel-device-plugins-for-kubernetes) - [Intel GPU Plugin Documentation](https://github.com/intel/intel-device-plugins-for-kubernetes/tree/main/cmd/gpu_plugin) - [Intel NPU Plugin Documentation](https://github.com/intel/intel-device-plugins-for-kubernetes/tree/main/cmd/npu_plugin) - [Kubernetes Device Plugin Framework](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/) - [Container Device Interface Guide](container-device-interface-guide.md) — for Docker/Podman usage without Kubernetes