# Optimizing π0.5 Vision–Language–Action Robotic Model on Intel Core Ultra Series 3 Processor Asaad F Said, Radwan Ibrahim, Harshil Patel, Deepak S, Alex Turk, Amrutha Dhanakumar, Chon Ming Lee, Sergey Shumihin, Anand Bodas, Parual Datta, Deepa Suresh, Vladislav Sovrasov, Daniil Lyakhov, Daan Krol, Alfie Roddan, Samet Akcay, Ashutosh Kumar, Greeshma Pisharody

Download Paper [pdf] See Code [GitHub] Watch Demo

## Abstract Vision–Language–Action (VLA) models unify perception, natural language understanding, and control in an end-to-end robotics framework. This enables generalization across tasks and environments with minimal task-specific customization compared with modular pipelines. Deploying VLA models at the edge is challenging as it requires high compute efficiency and substantial memory bandwidth for real-time inference. Intel® Core™ Ultra Series 3 (codenamed Panther Lake) addresses these requirements with a heterogeneous CPU–GPU–NPU architecture. It provides up to 180 TOPS of compute and up to 154 GB/s of memory bandwidth. In this work, we evaluate the edge deployment of the π0.5 VLA model, selected for its exceptional reliability on manipulation tasks in unfamiliar environments. We optimize π0.5 for Panther Lake. We then benchmark its performance against leading embedded AI platforms. We show up to 2.6× higher performance per watt than NVIDIA Jetson AGX Orin (64 GB) and up to 1.3× over NVIDIA® Jetson Thor™. These results position Panther Lake as a strong platform for real-time VLA inference in edge robotics. ## Demo :::{video} https://docs.openedgeplatform.intel.com/shared_media/publication-optimize-pi0.5-demo.mp4 :playsinline: :loop: :width: 70% :align: center ::: ## Quick Start - Run π0.5 on Intel XPU: :::::{tab-set} ::::{tab-item} **Installation** ```bash pip install physicalai-train ``` :::: ::::{tab-item} **Training** ```bash # config.yaml model: class_path: physicalai.policies.Pi05 init_args: paligemma_variant: gemma_2b action_expert_variant: gemma_300m dtype: bfloat16 data: class_path: physicalai.data.lerobot.LeRobotDataModule init_args: repo_id: "" # Train with config file physicalai fit --config config.yaml ``` ```python from physicalai.data import LeRobotDataModule from physicalai.policies import Pi05 from physicalai.train import Trainer datamodule = LeRobotDataModule(repo_id="") model = Pi05() trainer = Trainer() trainer.fit(model=model, datamodule=datamodule) ``` :::: ::::{tab-item} **Export** ```bash physicalai export \ --ckpt_path "" \ --backend "openvino" \ --output_dir "pi05_ov" ``` ```python from physicalai.policies import Pi05 policy = Pi05(pretrained_name_or_path="") policy.to_openvino(output_path="./pi05_ov") ``` :::: ::::{tab-item} **Inference** ```bash # runtime.yaml runtime: class_path: physicalai.runtime.PolicyRuntime init_args: fps: 30 robot: class_path: physicalai.robot.trossen.BimanualWidowXAI init_args: left: class_path: physicalai.robot.trossen.WidowXAI init_args: ip: "192.168.1.10" role: "follower" right: class_path: physicalai.robot.trossen.WidowXAI init_args: ip: "192.168.1.11" role: "follower" model: class_path: physicalai.inference.InferenceModel init_args: export_dir: ./exports/pi05_ov cameras: wrist: class_path: physicalai.capture.UVCCamera init_args: device: /dev/video0 width: 640 height: 480 overhead: class_path: physicalai.capture.RealSenseCamera init_args: serial: "123456789" execution: class_path: physicalai.runtime.SyncExecution init_args: mode: chunk # Run physicalai run --config runtime.yaml --duration-s 60 ``` ```python from physicalai.runtime import PolicyRuntime, SyncExecution from physicalai.inference import InferenceModel from physicalai.capture import UVCCamera, RealSenseCamera from physicalai.robot.trossen import WidowXAI, BimanualWidowXAI left = WidowXAI(ip="192.168.1.10", role="follower") right = WidowXAI(ip="192.168.1.11", role="follower") robot = BimanualWidowXAI(left=left, right=right) runtime = PolicyRuntime( fps=30, robot=robot, model=InferenceModel.load("./pi05_ov", backend="openvino"), cameras={ "wrist": UVCCamera(device="/dev/video0", width=640, height=480), "overhead": RealSenseCamera(serial="123456789"), }, execution=SyncExecution(mode="chunk"), ) runtime.run(duration_s=60) ``` :::: ::::: ## Methods We decompose π0.5 for heterogeneous execution across the Panther Lake compute hierarchy: the Vision Encoder (VE) and Language Model (LM) run on the iGPU, while the Action Expert (AE) runs on the NPU. The per-layer KV cache is the only cross-device handoff, carried via shared DDR with zero-copy USM-host tensors. ![π0.5 decomposition for heterogeneous execution](./_assets/optimizing-pi0.5-lva-model-figure1.png) *Figure 1. π0.5 decomposition for heterogeneous execution: VE and LM on the iGPU, AE on the NPU, with the per-layer KV cache as the only cross-device handoff in shared DDR.* ## Results We benchmark π0.5 (DROID variant, 3 cameras × 224×224, 64-token language context, 10 denoising steps, BF16/FP16) across three edge platforms at both Max TDP and ISO-TDP (40 W). ### Hardware Configurations | Spec | NVIDIA Jetson AGX Orin 64GB | NVIDIA Jetson Thor T5000 | Intel® Core™ Ultra Series 3 (X7 358H) | | --------------------- | ------------------------------ | ---------------------------- | ---------------------------------- | | TDP (W) | 15-60 | 40-130 | 15-65 | | CPU | 12-core Arm(R) Cortex(R)-A78AE | 14-core Arm(R) Neoverse-V3AE | 16-core (4P + 8E + 4LP-E) | | Memroy (GB) | 64 GB 256-bit LPDDR5 @204.8GB/s| 128GB 256-bit LPDDR5 @273GB/s| 32GB (2x16GB LPDDR5 8533 MT/s | | Storage (GB) | 64GB eMMC 5.1 | 1 TB NVMe SSD | 1 TB NVMe SSD | | Peak AI (INT8) | 275 TOPS (Sparse) | 1035 TOPS (Sparse) | 180 TOPS (Dense) | | GPU TOPS (INT8-Dense) | 85 | 517 | 122 | | Peak BW (GB/s) | 204 | 273 | 153 | | Capacity (GB) | 64 | 128 | 128 | | Operating System | Ubuntu 22.04.5 LTS | Ubuntu 24.04.3 LTS | Ubuntu 24.04.4 LTS | | Kernel version | 5.15.148-tegra | 6.8.12-tegra | 6.17.0-14-generic | | Jetpack version | 6.2.1 | 7.0 | N/A | | CUDA version | 12.6 | 13.0 | N/A | | OpenCL Compute Runtime version | N/A | N/A | 26.01.36711.4 | ### π0.5 Latency: PyTorch XPU vs CUDA Using stock PyTorch 2.12, Intel® Core™ Ultra Series 3 X7 358H processor at 40 W records 294 ms — within 2.4% of NVIDIA Jetson Thor at full power (287 ms), and 3.5× / 5.2× faster than NVIDIA Jetson AGX Orin at 60 W / 40 W respectively. ![π0.5 model latency using PyTorch on Panther Lake](./_assets/optimizing-pi0.5-lva-model-figure2.png) *Figure 2. π0.5 model latency using PyTorch on Core Ultra X7 358H, Jetson AGX Orin, and Jetson Thor.* ### Optimised Performance at Max TDP With OpenVINO (Intel processors) and TensorRT (NVIDIA) at each platform’s maximum power envelope (65 W / 60 W / 130 W), Core Ultra X7 358H achieves up to 1.5× lower latency than AGX Orin and up to 1.3× better performance per watt than both NVIDIA platforms. ![Competitive analysis of π0.5 on Panther Lake](./_assets/optimizing-pi0.5-lva-model-figure3.png) *Figure 3. Competitive analysis of π0.5 on Core Ultra X7 358H, Jetson AGX Orin, and Jetson Thor at Max TDP.* ### Optimised Performance at ISO-TDP (40 W) Capping all platforms at 40 W, Core Ultra X7 358H delivers the strongest results: up to 2.6× lower latency and 2.6× higher performance per watt versus Jetson AGX Orin, and 1.1× better latency and efficiency versus Jetson Thor. ![Competitive analysis of π0.5 on Panther Lake](./_assets/optimizing-pi0.5-lva-model-figure4.png) *Figure 4. Competitive analysis of π0.5 on Core Ultra X7 358H, Jetson AGX Orin, and Jetson Thor at 40 W TDP.*