Running NVIDIA’s V2X-I PointPillars Dense FP16 Model on Intel GPU#

Purpose. This guide describes how to take a model trained with NVIDIA’s CUDA-V2XFusion reference design and deploy it on Intel GPU via the intermediate-fusion deploy binary.

Audience. Customers who already hold a CUDA-V2XFusion-trained checkpoint — either NVIDIA’s provided reference model dense_epoch_100_.pth, or a checkpoint you produced yourself by following NVIDIA’s reference training flow — and want to run inference on Intel platform without any retraining or C++ changes on the deploy side.

Scope. Weight conversion only: take a CUDA-V2XFusion .pth, produce the 4-ONNX + INT8 OpenVINO IR artifacts the Intel deploy binary expects, install them, and run the binary end-to-end. No retraining, no mmdet3d edits, no config edits. Pipeline A (split 4-ONNX) only.


1. Overview#

 dense_epoch_100_.pth  (NVIDIA reference model)
          │
          ▼
 [FP32 export]           ─> export/V2X-I/pp/
   export_all.py              camera.backbone.onnx   (~85 MB)
                              lidar_pfe.onnx         (~18 KB, dynamic V)
                              fuser.onnx             (~48 MB)
                              head.onnx              (~2.4 MB)
          │
          ▼
 [Static V=7000 PFE]     ─> export/V2X-I/pp/
   export-lidar.py            lidar_pfe_v7000.onnx   (~4.8 MB)
          │
          ▼
 [INT8 PTQ (NNCF)]       ─> export/V2X-I/pp/
   quantize_all.py            quantized_camera.{xml,bin}
                              quantized_lidar_pfe.{xml,bin}
                              quantized_fuser.{xml,bin}
                              quantized_head.{xml,bin}
          │
          ▼
 [Copy to deploy tree]   ─> edge-ai-suites/metro-ai-suite/sensor-fusion-for-traffic-management/intermediate-fusion/
                              deploy/data/v2xfusion/pointpillars/
          │
          ▼
 [Run on Intel GPU]      cd deploy/build && ./bevfusion <dataset> --preset v2x --int8

The entire left column (export + quantize) happens inside NVIDIA’s bevfusion training repo after you apply the patch bundle this guide ships. The deploy binary already knows how to consume the files produced.


2. Prerequisites#

2.1 Set up NVIDIA’s bevfusion training repo#

Follow NVIDIA’s own instructions at Lidar_AI_Solution/CUDA-V2XFusion/README.md to:

  1. Clone MIT BEVFusion at the commit NVIDIA pins.

  2. Layer the BEVHeight and CUDA-V2XFusion patches on top as described in NVIDIA’s README.

  3. Install the Python environment: Python 3.8, torch==1.11, mmcv, mmdet3d, torchpack, and the usual MIT BEVFusion dependencies.

Do not attempt to run training — you only need the Python environment and the configs.

2.2 Download NVIDIA’s reference checkpoint#

Grab dense_epoch_100_.pth per NVIDIA’s CUDA-V2XFusion README. Note its absolute path — you will pass it to the export and quantize commands.

2.3 Add ONNX export + INT8 PTQ dependencies#

In the same Python environment you set up in §2.1, install the extras used by the scripts shipped with this guide:

pip install "nncf>=2.13" "openvino>=2024.4" "onnx" "onnxsim"

That is everything. The export and quantize scripts reuse mmdet3d, mmcv, and torchpack that are already present from §2.1.

2.4 Clone the deploy repo and build it#

Clone edge-ai-suites and follow its own documentation for the build:

  • deploy/README.md — top-level build instructions.

  • deploy/docs/Prerequisites.md — oneAPI + custom OpenVINO installation.

  • deploy/docs/GSG.md — full getting-started guide with build and run commands.

We deliberately do not duplicate those instructions here. Once you have a working deploy/build/bevfusion binary and its default dataset directory, come back to this guide.


3. Step 1 — Apply the patch bundle#

From the root of your NVIDIA bevfusion clone (the directory that contains tools/, mmdet3d/, configs/), run:

cp -r /path/to/this/Guide/nvidia_ckpt_to_intel_gpu_patches/* .

That drops 12 files under export/pointpillars/ (see Appendix A for the exact list). No existing file is touched — this is a pure addition.

Sanity check:

ls export/pointpillars/
# expected:
#   __init__.py  _calib_data.py
#   export_all.py  export-camera.py  export-lidar.py  export-fuser.py  export-head.py
#   quantize_all.py  quantize_camera_backbone.py  quantize_lidar_pfe.py  quantize_fuser.py  quantize_head.py

4. Step 2 — FP32 ONNX export (4 sub-graphs)#

The deploy binary splits the BEVFusion graph into four independently-loaded ONNX sub-graphs. Export all of them at once:

python export/pointpillars/export_all.py \
  --config configs/V2X-I/det/centerhead/lssfpn/camera+pointpillar/resnet34/default.yaml \
  --ckpt /path/to/dense_epoch_100_.pth \
  --out-dir export/V2X-I/pp/

Expected tail output:

[cam-export]   reference shapes: feat=(1, 80, 54, 96) depth=(1, 90, 54, 96)
[cam-export]   saved to export/V2X-I/pp/camera.backbone.onnx
[lidar-export] saved to export/V2X-I/pp/lidar_pfe.onnx
[fuser-export] saved to export/V2X-I/pp/fuser.onnx
[head-export]  saved to export/V2X-I/pp/head.onnx

SUMMARY
  [OK] camera   export/V2X-I/pp/camera.backbone.onnx  (~88 MB)
  [OK] lidar    export/V2X-I/pp/lidar_pfe.onnx       (~18 KB)
  [OK] fuser    export/V2X-I/pp/fuser.onnx           (~50 MB)
  [OK] head     export/V2X-I/pp/head.onnx            (~2.4 MB)

Benign warnings you will see and can ignore#

  • missing keys in source state_dict: encoders.camera.vtransform.cxcx is a non-learnable buffer used only for BEV coordinate offset; model __init__ fills the default value. Not present in NVIDIA’s ckpt, harmless for inference.

  • unexpected key in source state_dict: fc.weight, fc.bias — these come from the ResNet34 ImageNet-pretrained fc layer that BEVFusion never uses.


5. Step 3 — Export the static V=7000 PFE#

The Intel deploy binary’s split pipeline hard-codes a maximum of 7000 voxels per frame (see deploy/src/pipeline/split_pipeline_config.cppdefault_int8_pfe_model and default_fp32_pfe_model both pin max_voxels=7000). You therefore need a second PFE ONNX with a fixed batch-voxel dimension of 7000 in addition to the dynamic-V version from Step 2:

python export/pointpillars/export-lidar.py \
  --config configs/V2X-I/det/centerhead/lssfpn/camera+pointpillar/resnet34/default.yaml \
  --ckpt /path/to/dense_epoch_100_.pth \
  -o export/V2X-I/pp/lidar_pfe_v7000.onnx \
  --fixed-v 7000 \
  --split val

Expected tail output:

[lidar-export] tracing from cfg.data.val
[lidar-export] traced shapes: features=(5137, 100, 4) num_voxels=(5137,) coors=(5137, 4)
[lidar-export] wrapper vs pfe max-abs-diff = 0.000004
[lidar-export] exporting with FIXED V=7000 (measured dataset max V=6295, using safety margin)
[lidar-export] fixed-V sanity OK (no NaN), output (7000, 64)
[lidar-export] saved to export/V2X-I/pp/lidar_pfe_v7000.onnx

Important — do not drop --split val. The --split argument tells the tracer to pull a real frame from cfg.data.val, which determines the activation distribution the INT8 calibrator will see later. Using a trace frame from a mismatched dataset layout is a silent correctness bug.


6. Step 4 — INT8 PTQ quantization#

Calibrate and quantize the four ONNX models to INT8 OpenVINO IR:

python export/pointpillars/quantize_all.py \
  --config configs/V2X-I/det/centerhead/lssfpn/camera+pointpillar/resnet34/default.yaml \
  --ckpt /path/to/dense_epoch_100_.pth \
  --onnx-dir export/V2X-I/pp/ \
  --out-dir  export/V2X-I/pp/ \
  --num-samples 300

Run this with the same Python interpreter you used in Steps 2 and 3. The quantize scripts call into mmdet3d and torchpack to build real calibration samples, so they need the same mmdet3d-capable environment, not a separate NNCF-dedicated env.

Expected tail output:

SUMMARY
  [OK] camera     export/V2X-I/pp/quantized_camera.xml      (~420 KB) + .bin (~22 MB)
  [OK] lidar_pfe  export/V2X-I/pp/quantized_lidar_pfe.xml   (~73 KB)  + .bin (~2.5 MB)
  [OK] fuser      export/V2X-I/pp/quantized_fuser.xml       (~208 KB) + .bin (~12.5 MB)
  [OK] head       export/V2X-I/pp/quantized_head.xml        (~216 KB) + .bin (~614 KB)

quantize_all.py auto-detects lidar_pfe_v7000.onnx in --onnx-dir and uses it in preference to the dynamic-V PFE, which is what the deploy binary expects for INT8.


7. Step 5 — Install artifacts into the deploy tree#

The deploy binary looks for its model files under deploy/data/v2xfusion/pointpillars/ by default for --preset v2x. Copy both the FP32 fallback ONNXs and the INT8 IRs into that directory:

DEPLOY_DIR=/path/to/edge-ai-suites/metro-ai-suite/sensor-fusion-for-traffic-management/intermediate-fusion/deploy/data/v2xfusion/pointpillars

mkdir -p "$DEPLOY_DIR"

# FP32 fallbacks
cp export/V2X-I/pp/camera.backbone.onnx    "$DEPLOY_DIR/"
cp export/V2X-I/pp/lidar_pfe.onnx          "$DEPLOY_DIR/"
cp export/V2X-I/pp/lidar_pfe_v7000.onnx    "$DEPLOY_DIR/"
cp export/V2X-I/pp/fuser.onnx              "$DEPLOY_DIR/"
cp export/V2X-I/pp/head.onnx               "$DEPLOY_DIR/"

# INT8 IR pairs
cp export/V2X-I/pp/quantized_camera.xml    "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_camera.bin    "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_lidar_pfe.xml "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_lidar_pfe.bin "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_fuser.xml     "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_fuser.bin     "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_head.xml      "$DEPLOY_DIR/"
cp export/V2X-I/pp/quantized_head.bin      "$DEPLOY_DIR/"

If you want to keep multiple model variants side by side, you can put them under any directory and point the deploy binary at it explicitly with --model-dir (see Step 6).


8. Step 6 — Run the deploy binary#

Source the oneAPI and OpenVINO environments exactly as the deploy repo’s own deploy/README.md / deploy/docs/GSG.md describe, then:

cd /path/to/edge-ai-suites/metro-ai-suite/sensor-fusion-for-traffic-management/intermediate-fusion/deploy/build
./bevfusion /path/to/v2x_dataset --preset v2x --int8 --num-samples 30 --vis --save-video --vis-dir ./viz

Key flags:

Flag

Meaning

--preset v2x

V2X-I geometry, BEV grid 128×128, pc_range Y ∈ [-51.2, 51.2]

--int8

Use all four quantized_*.xml IRs (falls back to FP32 ONNX per stage if a file is missing)

--int8-camera / --int8-pfe / --int8-fuser / --int8-head

Toggle INT8 stage-by-stage

--model-dir DIR

Override the default data/v2xfusion/pointpillars/ location

--num-samples N

Process the first N frames

--dump-pred --pred-dir DIR

Write KITTI-format per-frame box .txt files

--vis --save-video --vis-dir DIR

Write bevfusion.mp4 and optional per-frame PNGs

Refer to the deploy repo’s own deploy/docs/GSG.md for the authoritative full flag list and expected performance figures on the target GPU.


9. Troubleshooting#

Symptom

Cause / Fix

Warning missing keys in source state_dict: encoders.camera.vtransform.cx

Benign. NVIDIA’s ckpt lacks this non-learnable buffer; the model default fills it. No action.

Warning unexpected key in source state_dict: fc.weight, fc.bias

Benign. ResNet34 ImageNet-pretrained fc layer that BEVFusion doesn’t use. No action.

ModuleNotFoundError: No module named 'torchpack' during Step 4

You’re running the quantize scripts in a different Python env than Step 2/3. Use the same mmdet3d-capable environment for all three steps.

ModuleNotFoundError: No module named 'nncf'

Step 2.3 was skipped — install nncf, openvino, onnx, onnxsim into the env you are using.

PFE INT8 numerically collapsed (poor detections with --int8-pfe)

Re-run Step 3 with --split val. Using a mismatched trace source produces wrong activation scales and the calibrator bakes them in.

Deploy binary silently runs FP32 even with --int8

One of the quantized_*.xml / .bin files is missing in the deploy model directory. Re-check Step 5. The deploy binary falls back to FP32 per stage when the INT8 IR is absent.

FP32 fuser used despite --int8 on Intel Arc B580 (Battlemage)

Expected behavior. The deploy binary has a known B580-specific INT8 fuser fallback; the other three stages still run INT8. See the deploy repo’s own notes.

onnx.checker failure on camera.backbone.onnx

The onnxsim simplification step during Step 2 may have failed silently if you’re on a very old onnxsim. Upgrade: pip install -U onnxsim onnx.


Appendix A — What the patch bundle adds#

<nv_bevfusion_root>/
└── export/
    └── pointpillars/
        ├── __init__.py                     (empty, makes the folder a package)
        ├── _calib_data.py                  (shared PyTorch-side calibration helper)
        ├── export_all.py                   (FP32 export orchestrator)
        ├── export-camera.py                (ResNet34 backbone + LSS neck + depthnet → camera.backbone.onnx)
        ├── export-lidar.py                 (PillarFeatureNet → lidar_pfe[,_v7000].onnx)
        ├── export-fuser.py                 (ConvFuser + decoder → fuser.onnx)
        ├── export-head.py                  (CenterHead → head.onnx, 12 output tensors)
        ├── quantize_all.py                 (INT8 PTQ orchestrator, auto-picks v7000 PFE)
        ├── quantize_camera_backbone.py     (NNCF PTQ on camera.backbone.onnx)
        ├── quantize_lidar_pfe.py           (NNCF PTQ on lidar_pfe_v7000.onnx)
        ├── quantize_fuser.py               (NNCF PTQ on fuser.onnx)
        └── quantize_head.py                (NNCF PTQ on head.onnx)

Nothing under mmdet3d/, configs/, or tools/ is touched. The patch is purely additive.

Appendix B — Why NVIDIA’s ckpt works without any code/config changes#

  • State dict shape — the NVIDIA checkpoint and a model built from configs/V2X-I/det/centerhead/lssfpn/camera+pointpillar/resnet34/default.yaml agree on every weight tensor’s shape. The only difference is the 3-element encoders.camera.vtransform.cx buffer, which is a non-learnable constant that the model constructor fills with the default.

  • Pipeline A does not touch LSS’s get_cam_feats()export-camera.py exports backbone neck depthnet directly and does the per-pixel depth softmax inline, so any downstream use_bevpool branching in mmdet3d/models/vtransforms/lss.py is irrelevant.

  • ResNet34 pretrained URL vs local path — NVIDIA’s config references the remote pretrained URL; the only effect is where the ImageNet init comes from. Those weights are overwritten by the NVIDIA checkpoint anyway, so this mismatch is invisible at inference time.

  • strict=False — every export and quantize script loads the checkpoint with strict=False, so the cx missing-key and the ResNet34 fc.* extra-keys warnings are just logs, not errors.