Model Tutorials#
Intel OpenVINO supports most of the TensorFlow and PyTorch models. The table below lists some deep learning models that commonly used in the Embodied Intelligence solutions. You can find information about how to run them on Intel platforms:
Algorithm
Description
Link
YOLOv8
CNN based object detection
YOLOv12
CNN based object detection
MobileNetV2
CNN based object detection
SAM
Transformer based segmentation
SAM2
Extend SAM to video segmentation and object tracking with cross attention to memory
FastSAM
Lightweight substitute to SAM
MobileSAM
Lightweight substitute to SAM (Same model architecture with SAM. Can refer to OpenVINO SAM tutorials for model export and application)
U-NET
CNN based segmentation and diffusion model
DETR
Transformer based object detection
GroundingDino
Transformer based object detection
CLIP
Transformer based image classification
Qwen2.5VL
Multimodal large language model
Whisper
Automatic speech recognition
FunASR
Automatic speech recognition
Refer to the FunASR Setup in LLM Robotics sample pipeline
Attention
When following these tutorials for model conversion, please ensure that the OpenVINO version used for model conversion is the same as the runtime version used for inference. Otherwise, unexpected errors may occur, especially if the model is converted using a newer version and the runtime is an older version. See more details in the Troubleshooting.
Please also find information for the models of imitation learning, grasp generation, simultaneous localization and mapping (SLAM) and bird’s-eye view (BEV):
Note
Before using these models, please ensure that you have read the AI Content Disclaimer.
- Action Chunking with Transformers - ACT
- Visual Servoing - CNS
- Diffusion Policy
- Improved 3D Diffusion Policy (iDP3)
- Feature Extraction Model: SuperPoint
- Feature Tracking Model: LightGlue
- Bird’s Eye View Perception: Fast-BEV
- Monocular Depth Estimation: Depth Anything V2
- Robotics Diffusion Transformer (RDT-1B)