# Transformer Models

This article explains how to prepare models based on the [Hugging Face](https://huggingface.co/welcome) [`transformers`](https://github.com/huggingface/transformers) library for integration with the Deep Learning Streamer pipeline.

Many transformer-based models can be converted to OpenVINO™ IR format using [optimum-cli](https://huggingface.co/docs/optimum-intel/en/openvino/export). DL Streamer supports selected Hugging Face architectures for tasks such as image classification, object detection, audio transcription, and more. See the [Supported Models](https://docs.openedgeplatform.intel.com/dev/edge-ai-libraries/dlstreamer/supported_models.html) table for details.

> **NOTE:** The instructions below are comprehensive, but for convenience, we recommend using the
> [download_hf_models.py](https://github.com/open-edge-platform/dlstreamer/blob/main/scripts/download_models/download_hf_models.py)
> script. It can download a model from the Hugging Face Hub and perform the required conversions automatically.
> See [Model Conversion Scripts](https://github.com/open-edge-platform/dlstreamer/blob/main/scripts/download_models/README.md) for more information.

## Optimum-Intel Supported Models

The list available [here](https://huggingface.co/docs/optimum-intel/en/openvino/models) includes models that can be converted to IR format with a single `optimum-cli` command. If a model architecture is [supported by DL Streamer](https://docs.openedgeplatform.intel.com/dev/edge-ai-libraries/dlstreamer/supported_models.html#supported-architectures), it can typically be prepared as follows:

```bash
optimum-cli export openvino --model provider_id/model_id --weight-format=int8 output_path
```

The directory specified by `output_path` contains all files required to use the model with DL Streamer elements such as `gvaclassify` or `gvagenai`. No further modifications are required. Some Visual Language Models (VLMs) may require additional `optimum-cli` options; see the [OpenVINO™ documentation](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#visual-language-models-vlms) for details.

## RT-DETR and RT-DETRv2 Models

Hugging Face models based on the `RTDetrForObjectDetection` and `RtDetrV2ForObjectDetection` architectures must first be exported to ONNX format using [Optimum-ONNX](https://huggingface.co/docs/optimum-onnx/onnx/usage_guides/export_a_model). For example:

```bash
optimum-cli export onnx --model PekingU/rtdetr_v2_r18vd --task object-detection --opset 18 --width 640 --height 640 ./out/rtdetr_v2_r18vd_onnx
```

This command creates the `./out/rtdetr_v2_r18vd_onnx/model.onnx` file, which can then be converted to IR format using the OpenVINO [ovc tool](https://docs.openvino.ai/2026/openvino-workflow/model-preparation/convert-model-onnx.html):

```bash
ovc ./out/rtdetr_v2_r18vd_onnx/model.onnx
```

The `./out/rtdetr_v2_r18vd_onnx/` directory now contains all files required to use the model with the DL Streamer `gvadetect` element.

## CLIP Models

DL Streamer supports using the Vision Transformer (ViT) component of CLIP models to generate image embeddings. However, this component cannot be extracted from the `CLIPModel` architecture by using `optimum-cli`. Instead, use the following Python script to convert the Vision Transformer from **clip-vit-large-patch14**, **clip-vit-base-patch16**, or **clip-vit-base-patch32** to Intel® OpenVINO™ format. Because conversion is best performed with a sample input, prepare an image in a common format and replace `IMG_PATH` with the appropriate value.

```python
from transformers import CLIPProcessor, CLIPVisionModel
import PIL
import openvino as ov
from openvino.runtime import PartialShape, Type
import sys
import os

MODEL='clip-vit-large-patch14'
IMG_PATH = "sample_image.jpg"

img = PIL.Image.open(IMG_PATH)
vision_model = CLIPVisionModel.from_pretrained('openai/'+MODEL)
processor = CLIPProcessor.from_pretrained('openai/'+MODEL)
batch = processor.image_processor(images=img, return_tensors='pt')["pixel_values"]

print("Conversion starting...")
ov_model = ov.convert_model(vision_model, example_input=batch)
print("Conversion finished.")

# Define the input shape explicitly
input_shape = PartialShape([-1, batch.shape[1], batch.shape[2], batch.shape[3]])

# Set the input shape and type explicitly
for input in ov_model.inputs:
   input.get_node().set_partial_shape(PartialShape(input_shape))
   input.get_node().set_element_type(Type.f32)

ov_model.set_rt_info("clip_token", ['model_info', 'model_type'])
ov_model.set_rt_info("68.500,66.632,70.323", ['model_info', 'scale_values'])
ov_model.set_rt_info("122.771,116.746,104.094", ['model_info', 'mean_values'])
ov_model.set_rt_info("True", ['model_info', 'reverse_input_channels'])
ov_model.set_rt_info("crop", ['model_info', 'resize_type'])

ov.save_model(ov_model, MODEL + ".xml")
```

Alternatively, you can use the [download_hf_models.py](https://github.com/open-edge-platform/dlstreamer/blob/main/scripts/download_models/download_hf_models.py) script, to perform the above steps automatically.

## Model Usage

The choice of the DL Streamer element that should be used to perform the inference with a given model depends on the task. The table below maps the tasks and sample model architectures to the appropriate inference elements.

| Task | Example Architecture | Inference element |
| --- | --- | --- |
| Image Classification | ViTForImageClassification | `gvaclassify` |
| Object Detection | RTDetrForObjectDetection | `gvadetect` |
| Speech Recognition | WhisperForConditionalGeneration  | `gvaaudiotranscribe` |
| Image To Text (VLMs) | Phi4MMForCausalLM | `gvagenai` |
| Image Embeddings | CLIPModel | `gvainference` |

See the following samples for detailed examples of DL Streamer pipelines that use transformer-based models:

1. [Using VLM Models With gvagenai Element](https://github.com/open-edge-platform/dlstreamer/tree/main/samples/gstreamer/gst_launch/gvagenai)
2. [Image Embeddings Generation with ViT](https://github.com/open-edge-platform/dlstreamer/blob/main/samples/gstreamer/gst_launch/lvm/)
3. [Face Detection and Classification](https://github.com/open-edge-platform/dlstreamer/tree/main/samples/gstreamer/python/face_detection_and_classification)
4. [Smart Network Video Recorder for Lane Hogging Detection](https://github.com/open-edge-platform/dlstreamer/tree/main/samples/gstreamer/python/smart_nvr)
5. [VLM Alerts](https://github.com/open-edge-platform/dlstreamer/tree/main/samples/gstreamer/python/vlm_alerts)