Model Tutorials#
The OpenVINO™ toolkit supports most TensorFlow and PyTorch models. The following table lists deep-learning models commonly used in the Embodied Intelligence solutions, and information on how to run them on Intel® platforms:
Algorithm |
Description |
Link |
|---|---|---|
YOLOv8 |
CNN-based object detection |
https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/yolov8-optimization |
YOLOv12 |
CNN-based object detection |
https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/yolov12-optimization |
MobileNetV2 |
CNN-based object detection |
https://github.com/openvinotoolkit/open_model_zoo/blob/master/models/public/mobilenet-v2-1.0-224 |
SAM |
Transformer-based segmentation |
https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/segment-anything |
SAM2 |
Extend SAM to video segmentation and object tracking with cross attention to memory |
https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/sam2-image-segmentation |
FastSAM |
Lightweight substitute to SAM |
https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/fast-segment-anything |
MobileSAM |
Lightweight substitute to SAM (Same model architecture as SAM. See OpenVINO SAM tutorials for model export and application |
https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/segment-anything |
U-NET |
CNN-based segmentation and diffusion model |
https://community.intel.com/t5/Blogs/Products-and-Solutions/Healthcare/Optimizing-Brain-Tumor-Segmentation-BTS-U-Net-model-using-Intel/post/1399037?wapkw=U-Net |
DETR |
Transformer-based object detection |
https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/detr-resnet50 |
GroundingDino |
Transformer-based object detection |
https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/grounded-segment-anything |
CLIP |
Transformer-based image classification |
https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/clip-zero-shot-image-classification |
Qwen2.5VL |
Multimodal large language model |
https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/qwen2.5-vl |
Whisper |
Automatic speech recognition |
https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/whisper-asr-genai |
FunASR |
Automatic speech recognition |
See the FunASR Setup funasr-setup in LLM Robotics sample pipeline |
Attention: When following these tutorials for model conversion, ensure that the OpenVINO toolkit version used for model conversion is the same as the runtime version used for inference. Otherwise, unexpected errors may occur, especially if the model is converted using a newer version and the runtime is an older version. See details in the Troubleshooting section.
Please also find information for the models of imitation learning, grasp generation, simultaneous localization and mapping (SLAM) and bird’s-eye view (BEV):
Note: Before using these models, read the AI Content Disclaimer.
- Action Chunking with Transformers - ACT
- Visual Servoing - CNS
- Diffusion Policy
- Improved 3D Diffusion Policy (iDP3)
- Feature Extraction Model: SuperPoint
- Feature Tracking Model: LightGlue
- Bird’s Eye View Perception: Fast-BEV
- Monocular Depth Estimation: Depth Anything V2
- Robotics Diffusion Transformer (RDT-1B)