# Large Vision Models

This page illustrates how to prepare the Vision Transformer from the
CLIP models for integration with the Deep Learning Streamer pipeline.

> **NOTE:** The instructions provided below are comprehensive, but for convenience,
> it is recommended to use the
> [download_public_models.sh](https://github.com/open-edge-platform/edge-ai-libraries/tree/main/libraries/dl-streamer/samples/download_public_models.sh)
> script. This script will download all supported models and perform the
> necessary conversions automatically. See [download_public_models](./download_public_models.md) for more information.

## 1. Setup

The instructions assume Deep Learning Streamer framework is installed on the
local system along with Intel® OpenVINO™ model downloader and converter
tools, as described here:
[Tutorial](../get_started/tutorial.md#setup).

It is also necessary to install the Transformers and Pillow packages:

```bash
pip install transformers
pip install pillow
```

## 2. Model preparation

Below is a Python script for converting the Vision Transformer from the
**clip-vit-large-patch14**/**clip-vit-base-patch16**/**clip-vit-base-patch32**
model to the Intel® OpenVINO™ format. Since using a sample input is
recommended during the conversion, prepare a sample image in one of the
common formats and replace *IMG_PATH* with the relevant value:

### clip-vit-large-patch14

``` python
from transformers import CLIPProcessor, CLIPVisionModel
import PIL
import openvino as ov
from openvino.runtime import PartialShape, Type
import sys
import os

MODEL='clip-vit-large-patch14'
IMG_PATH = "sample_image.jpg"

img = PIL.Image.open(IMG_PATH)
vision_model = CLIPVisionModel.from_pretrained('openai/'+MODEL)
processor = CLIPProcessor.from_pretrained('openai/'+MODEL)
batch = processor.image_processor(images=img, return_tensors='pt')["pixel_values"]

print("Conversion starting...")
ov_model = ov.convert_model(vision_model, example_input=batch)
print("Conversion finished.")

# Define the input shape explicitly
input_shape = PartialShape([-1, batch.shape[1], batch.shape[2], batch.shape[3]])

# Set the input shape and type explicitly
for input in ov_model.inputs:
   input.get_node().set_partial_shape(PartialShape(input_shape))
   input.get_node().set_element_type(Type.f32)

ov_model.set_rt_info("clip_token", ['model_info', 'model_type'])
ov_model.set_rt_info("68.500,66.632,70.323", ['model_info', 'scale_values'])
ov_model.set_rt_info("122.771,116.746,104.094", ['model_info', 'mean_values'])
ov_model.set_rt_info("True", ['model_info', 'reverse_input_channels'])
ov_model.set_rt_info("crop", ['model_info', 'resize_type'])

ov.save_model(ov_model, MODEL + ".xml")
```


### clip-vit-base-patch16

``` python
from transformers import CLIPProcessor, CLIPVisionModel
import PIL
import openvino as ov
from openvino.runtime import PartialShape, Type
import sys
import os

MODEL='clip-vit-base-patch16'
IMG_PATH = "sample_image.jpg"

img = PIL.Image.open(IMG_PATH)
vision_model = CLIPVisionModel.from_pretrained('openai/'+MODEL)
processor = CLIPProcessor.from_pretrained('openai/'+MODEL)
batch = processor.image_processor(images=img, return_tensors='pt')["pixel_values"]

print("Conversion starting...")
ov_model = ov.convert_model(vision_model, example_input=batch)
print("Conversion finished.")

# Define the input shape explicitly
input_shape = PartialShape([-1, batch.shape[1], batch.shape[2], batch.shape[3]])

# Set the input shape and type explicitly
for input in ov_model.inputs:
   input.get_node().set_partial_shape(PartialShape(input_shape))
   input.get_node().set_element_type(Type.f32)

ov_model.set_rt_info("clip_token", ['model_info', 'model_type'])
ov_model.set_rt_info("68.500,66.632,70.323", ['model_info', 'scale_values'])
ov_model.set_rt_info("122.771,116.746,104.094", ['model_info', 'mean_values'])
ov_model.set_rt_info("True", ['model_info', 'reverse_input_channels'])
ov_model.set_rt_info("crop", ['model_info', 'resize_type'])

ov.save_model(ov_model, MODEL + ".xml")
```

### clip-vit-base-patch32

``` python
from transformers import CLIPProcessor, CLIPVisionModel
import PIL
import openvino as ov
from openvino.runtime import PartialShape, Type
import sys
import os

MODEL='clip-vit-base-patch32'
IMG_PATH = "sample_image.jpg"

img = PIL.Image.open(IMG_PATH)
vision_model = CLIPVisionModel.from_pretrained('openai/'+MODEL)
processor = CLIPProcessor.from_pretrained('openai/'+MODEL)
batch = processor.image_processor(images=img, return_tensors='pt')["pixel_values"]

print("Conversion starting...")
ov_model = ov.convert_model(vision_model, example_input=batch)
print("Conversion finished.")

# Define the input shape explicitly
input_shape = PartialShape([-1, batch.shape[1], batch.shape[2], batch.shape[3]])

# Set the input shape and type explicitly
for input in ov_model.inputs:
   input.get_node().set_partial_shape(PartialShape(input_shape))
   input.get_node().set_element_type(Type.f32)

ov_model.set_rt_info("clip_token", ['model_info', 'model_type'])
ov_model.set_rt_info("68.500,66.632,70.323", ['model_info', 'scale_values'])
ov_model.set_rt_info("122.771,116.746,104.094", ['model_info', 'mean_values'])
ov_model.set_rt_info("True", ['model_info', 'reverse_input_channels'])
ov_model.set_rt_info("crop", ['model_info', 'resize_type'])

ov.save_model(ov_model, MODEL + ".xml")
```

## 3. Model usage

See the [generate_frame_embeddings.sh](https://github.com/open-edge-platform/edge-ai-libraries/blob/main/libraries/dl-streamer/samples/gstreamer/gst_launch/lvm/generate_frame_embeddings.sh) sample for detailed
examples of Deep Learning Streamer pipelines using the model.