Running NVIDIA’s V2X-I PointPillars Dense FP16 Model on Intel GPU#
Purpose. This guide describes how to take a model trained with NVIDIA’s CUDA-V2XFusion reference design and deploy it on Intel GPU via the intermediate-fusion deploy binary.
Audience. Customers who already hold a CUDA-V2XFusion-trained checkpoint — either NVIDIA’s provided reference model dense_epoch_100_.pth, or a checkpoint you produced yourself by following NVIDIA’s reference training flow — and want to run inference on Intel platform without any retraining or C++ changes on the deploy side.
Scope. Weight conversion only: take a CUDA-V2XFusion .pth, produce the 4-ONNX + INT8 OpenVINO IR artifacts the Intel deploy binary expects, install them, and run the binary end-to-end. No retraining, no mmdet3d edits, no config edits. Pipeline A (split 4-ONNX) only.
1. Overview#
dense_epoch_100_.pth (NVIDIA reference model)
│
▼
[FP32 export] ─> export/V2X-I/pp/
export_all.py camera.backbone.onnx (~85 MB)
lidar_pfe.onnx (~18 KB, dynamic V)
fuser.onnx (~48 MB)
head.onnx (~2.4 MB)
│
▼
[Static V=7000 PFE] ─> export/V2X-I/pp/
export-lidar.py lidar_pfe_v7000.onnx (~4.8 MB)
│
▼
[INT8 PTQ (NNCF)] ─> export/V2X-I/pp/
quantize_all.py quantized_camera.{xml,bin}
quantized_lidar_pfe.{xml,bin}
quantized_fuser.{xml,bin}
quantized_head.{xml,bin}
│
▼
[Copy to deploy tree] ─> edge-ai-suites/metro-ai-suite/sensor-fusion-for-traffic-management/intermediate-fusion/
deploy/data/v2xfusion/pointpillars/
│
▼
[Run on Intel GPU] cd deploy/build && ./bevfusion <dataset> --preset v2x --int8
The entire left column (export + quantize) happens inside NVIDIA’s bevfusion training repo after you apply the patch bundle this guide ships. The deploy binary already knows how to consume the files produced.
2. Prerequisites#
2.1 Set up NVIDIA’s bevfusion training repo#
Follow NVIDIA’s own instructions at Lidar_AI_Solution/CUDA-V2XFusion/README.md to:
Clone MIT BEVFusion at the commit NVIDIA pins.
Layer the BEVHeight and CUDA-V2XFusion patches on top as described in NVIDIA’s README.
Install the Python environment: Python 3.8,
torch==1.11,mmcv,mmdet3d,torchpack, and the usual MIT BEVFusion dependencies.
Do not attempt to run training — you only need the Python environment and the configs.
2.2 Download NVIDIA’s reference checkpoint#
Grab dense_epoch_100_.pth per NVIDIA’s CUDA-V2XFusion README. Note its absolute path — you will pass it to the export and quantize commands.
2.3 Add ONNX export + INT8 PTQ dependencies#
In the same Python environment you set up in §2.1, install the extras used by the scripts shipped with this guide:
pip install "nncf>=2.13" "openvino>=2024.4" "onnx" "onnxsim"
That is everything. The export and quantize scripts reuse mmdet3d, mmcv, and torchpack that are already present from §2.1.
2.4 Clone the deploy repo and build it#
Clone edge-ai-suites and follow its own documentation for the build:
deploy/README.md— top-level build instructions.deploy/docs/Prerequisites.md— oneAPI + custom OpenVINO installation.deploy/docs/GSG.md— full getting-started guide with build and run commands.
We deliberately do not duplicate those instructions here. Once you have a working deploy/build/bevfusion binary and its default dataset directory, come back to this guide.
3. Step 1 — Apply the patch bundle#
From the root of your NVIDIA bevfusion clone (the directory that contains tools/, mmdet3d/, configs/), run:
cp -r /path/to/this/Guide/nvidia_ckpt_to_intel_gpu_patches/* .
That drops 12 files under export/pointpillars/ (see Appendix A for the exact list). No existing file is touched — this is a pure addition.
Sanity check:
ls export/pointpillars/
# expected:
# __init__.py _calib_data.py
# export_all.py export-camera.py export-lidar.py export-fuser.py export-head.py
# quantize_all.py quantize_camera_backbone.py quantize_lidar_pfe.py quantize_fuser.py quantize_head.py
4. Step 2 — FP32 ONNX export (4 sub-graphs)#
The deploy binary splits the BEVFusion graph into four independently-loaded ONNX sub-graphs. Export all of them at once:
python export/pointpillars/export_all.py \
--config configs/V2X-I/det/centerhead/lssfpn/camera+pointpillar/resnet34/default.yaml \
--ckpt /path/to/dense_epoch_100_.pth \
--out-dir export/V2X-I/pp/
Expected tail output:
[cam-export] reference shapes: feat=(1, 80, 54, 96) depth=(1, 90, 54, 96)
[cam-export] saved to export/V2X-I/pp/camera.backbone.onnx
[lidar-export] saved to export/V2X-I/pp/lidar_pfe.onnx
[fuser-export] saved to export/V2X-I/pp/fuser.onnx
[head-export] saved to export/V2X-I/pp/head.onnx
SUMMARY
[OK] camera export/V2X-I/pp/camera.backbone.onnx (~88 MB)
[OK] lidar export/V2X-I/pp/lidar_pfe.onnx (~18 KB)
[OK] fuser export/V2X-I/pp/fuser.onnx (~50 MB)
[OK] head export/V2X-I/pp/head.onnx (~2.4 MB)
Benign warnings you will see and can ignore#
missing keys in source state_dict: encoders.camera.vtransform.cx—cxis a non-learnable buffer used only for BEV coordinate offset; model__init__fills the default value. Not present in NVIDIA’s ckpt, harmless for inference.unexpected key in source state_dict: fc.weight, fc.bias— these come from the ResNet34 ImageNet-pretrained fc layer that BEVFusion never uses.
5. Step 3 — Export the static V=7000 PFE#
The Intel deploy binary’s split pipeline hard-codes a maximum of 7000 voxels per frame (see deploy/src/pipeline/split_pipeline_config.cpp — default_int8_pfe_model and default_fp32_pfe_model both pin max_voxels=7000). You therefore need a second PFE ONNX with a fixed batch-voxel dimension of 7000 in addition to the dynamic-V version from Step 2:
python export/pointpillars/export-lidar.py \
--config configs/V2X-I/det/centerhead/lssfpn/camera+pointpillar/resnet34/default.yaml \
--ckpt /path/to/dense_epoch_100_.pth \
-o export/V2X-I/pp/lidar_pfe_v7000.onnx \
--fixed-v 7000 \
--split val
Expected tail output:
[lidar-export] tracing from cfg.data.val
[lidar-export] traced shapes: features=(5137, 100, 4) num_voxels=(5137,) coors=(5137, 4)
[lidar-export] wrapper vs pfe max-abs-diff = 0.000004
[lidar-export] exporting with FIXED V=7000 (measured dataset max V=6295, using safety margin)
[lidar-export] fixed-V sanity OK (no NaN), output (7000, 64)
[lidar-export] saved to export/V2X-I/pp/lidar_pfe_v7000.onnx
Important — do not drop --split val. The --split argument tells the tracer to pull a real frame from cfg.data.val, which determines the activation distribution the INT8 calibrator will see later. Using a trace frame from a mismatched dataset layout is a silent correctness bug.
6. Step 4 — INT8 PTQ quantization#
Calibrate and quantize the four ONNX models to INT8 OpenVINO IR:
python export/pointpillars/quantize_all.py \
--config configs/V2X-I/det/centerhead/lssfpn/camera+pointpillar/resnet34/default.yaml \
--ckpt /path/to/dense_epoch_100_.pth \
--onnx-dir export/V2X-I/pp/ \
--out-dir export/V2X-I/pp/ \
--num-samples 300
Run this with the same Python interpreter you used in Steps 2 and 3. The quantize scripts call into mmdet3d and torchpack to build real calibration samples, so they need the same mmdet3d-capable environment, not a separate NNCF-dedicated env.
Expected tail output:
SUMMARY
[OK] camera export/V2X-I/pp/quantized_camera.xml (~420 KB) + .bin (~22 MB)
[OK] lidar_pfe export/V2X-I/pp/quantized_lidar_pfe.xml (~73 KB) + .bin (~2.5 MB)
[OK] fuser export/V2X-I/pp/quantized_fuser.xml (~208 KB) + .bin (~12.5 MB)
[OK] head export/V2X-I/pp/quantized_head.xml (~216 KB) + .bin (~614 KB)
quantize_all.py auto-detects lidar_pfe_v7000.onnx in --onnx-dir and uses it in preference to the dynamic-V PFE, which is what the deploy binary expects for INT8.
7. Step 5 — Install artifacts into the deploy tree#
The deploy binary looks for its model files under deploy/data/v2xfusion/pointpillars/ by default for --preset v2x. Copy both the FP32 fallback ONNXs and the INT8 IRs into that directory:
DEPLOY_DIR=/path/to/edge-ai-suites/metro-ai-suite/sensor-fusion-for-traffic-management/intermediate-fusion/deploy/data/v2xfusion/pointpillars
mkdir -p "$DEPLOY_DIR"
# FP32 fallbacks
cp export/V2X-I/pp/camera.backbone.onnx "$DEPLOY_DIR/"
cp export/V2X-I/pp/lidar_pfe.onnx "$DEPLOY_DIR/"
cp export/V2X-I/pp/lidar_pfe_v7000.onnx "$DEPLOY_DIR/"
cp export/V2X-I/pp/fuser.onnx "$DEPLOY_DIR/"
cp export/V2X-I/pp/head.onnx "$DEPLOY_DIR/"
# INT8 IR pairs
cp export/V2X-I/pp/quantized_camera.xml "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_camera.bin "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_lidar_pfe.xml "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_lidar_pfe.bin "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_fuser.xml "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_fuser.bin "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_head.xml "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_head.bin "$DEPLOY_DIR/"
If you want to keep multiple model variants side by side, you can put them under any directory and point the deploy binary at it explicitly with --model-dir (see Step 6).
8. Step 6 — Run the deploy binary#
Source the oneAPI and OpenVINO environments exactly as the deploy repo’s own deploy/README.md / deploy/docs/GSG.md describe, then:
cd /path/to/edge-ai-suites/metro-ai-suite/sensor-fusion-for-traffic-management/intermediate-fusion/deploy/build
./bevfusion /path/to/v2x_dataset --preset v2x --int8 --num-samples 30 --vis --save-video --vis-dir ./viz
Key flags:
Flag |
Meaning |
|---|---|
|
V2X-I geometry, BEV grid 128×128, pc_range Y ∈ [-51.2, 51.2] |
|
Use all four |
|
Toggle INT8 stage-by-stage |
|
Override the default |
|
Process the first N frames |
|
Write KITTI-format per-frame box |
|
Write |
Refer to the deploy repo’s own deploy/docs/GSG.md for the authoritative full flag list and expected performance figures on the target GPU.
9. Troubleshooting#
Symptom |
Cause / Fix |
|---|---|
Warning |
Benign. NVIDIA’s ckpt lacks this non-learnable buffer; the model default fills it. No action. |
Warning |
Benign. ResNet34 ImageNet-pretrained fc layer that BEVFusion doesn’t use. No action. |
|
You’re running the quantize scripts in a different Python env than Step 2/3. Use the same mmdet3d-capable environment for all three steps. |
|
Step 2.3 was skipped — install |
PFE INT8 numerically collapsed (poor detections with |
Re-run Step 3 with |
Deploy binary silently runs FP32 even with |
One of the |
FP32 fuser used despite |
Expected behavior. The deploy binary has a known B580-specific INT8 fuser fallback; the other three stages still run INT8. See the deploy repo’s own notes. |
|
The |
Appendix A — What the patch bundle adds#
<nv_bevfusion_root>/
└── export/
└── pointpillars/
├── __init__.py (empty, makes the folder a package)
├── _calib_data.py (shared PyTorch-side calibration helper)
├── export_all.py (FP32 export orchestrator)
├── export-camera.py (ResNet34 backbone + LSS neck + depthnet → camera.backbone.onnx)
├── export-lidar.py (PillarFeatureNet → lidar_pfe[,_v7000].onnx)
├── export-fuser.py (ConvFuser + decoder → fuser.onnx)
├── export-head.py (CenterHead → head.onnx, 12 output tensors)
├── quantize_all.py (INT8 PTQ orchestrator, auto-picks v7000 PFE)
├── quantize_camera_backbone.py (NNCF PTQ on camera.backbone.onnx)
├── quantize_lidar_pfe.py (NNCF PTQ on lidar_pfe_v7000.onnx)
├── quantize_fuser.py (NNCF PTQ on fuser.onnx)
└── quantize_head.py (NNCF PTQ on head.onnx)
Nothing under mmdet3d/, configs/, or tools/ is touched. The patch is purely additive.
Appendix B — Why NVIDIA’s ckpt works without any code/config changes#
State dict shape — the NVIDIA checkpoint and a model built from
configs/V2X-I/det/centerhead/lssfpn/camera+pointpillar/resnet34/default.yamlagree on every weight tensor’s shape. The only difference is the 3-elementencoders.camera.vtransform.cxbuffer, which is a non-learnable constant that the model constructor fills with the default.Pipeline A does not touch LSS’s
get_cam_feats()—export-camera.pyexportsbackbone → neck → depthnetdirectly and does the per-pixel depth softmax inline, so any downstreamuse_bevpoolbranching inmmdet3d/models/vtransforms/lss.pyis irrelevant.ResNet34 pretrained URL vs local path — NVIDIA’s config references the remote pretrained URL; the only effect is where the ImageNet init comes from. Those weights are overwritten by the NVIDIA checkpoint anyway, so this mismatch is invisible at inference time.
strict=False— every export and quantize script loads the checkpoint withstrict=False, so thecxmissing-key and the ResNet34fc.*extra-keys warnings are just logs, not errors.