# Supported Models The Multimodal Embedding Serving microservice supports multiple vision-language models for generating embeddings from text, images, and videos. ## Available Models ### CLIP (Contrastive Language-Image Pretraining) | Model ID | Architecture | Embedding Dimension | |----------|--------------|---------------------| | `CLIP/clip-vit-b-32` | ViT-B-32 | 512 | | `CLIP/clip-vit-b-16` | ViT-B-16 | 512 | | `CLIP/clip-vit-l-14` | ViT-L-14 | 768 | | `CLIP/clip-vit-h-14` | ViT-H-14 | 1024 | Standard OpenAI CLIP models for general-purpose vision-language understanding. ### CN-CLIP (Chinese CLIP) | Model ID | Architecture | Embedding Dimension | |----------|--------------|---------------------| | `CN-CLIP/cn-clip-vit-b-16` | ViT-B-16 | 512 | | `CN-CLIP/cn-clip-vit-l-14` | ViT-L-14 | 768 | | `CN-CLIP/cn-clip-vit-h-14` | ViT-H-14 | 1024 | Chinese-optimized CLIP models supporting both Chinese and English text. ### MobileCLIP | Model ID | Architecture | Embedding Dimension | |----------|--------------|---------------------| | `MobileCLIP/mobileclip_s0` | MobileCLIP-S0 | 512 | | `MobileCLIP/mobileclip_s1` | MobileCLIP-S1 | 512 | | `MobileCLIP/mobileclip_s2` | MobileCLIP-S2 | 512 | | `MobileCLIP/mobileclip_b` | MobileCLIP-B | 512 | | `MobileCLIP/mobileclip_blt` | MobileCLIP-BLT | 512 | Lightweight CLIP models designed for mobile and edge deployment. ### SigLIP | Model ID | Architecture | Embedding Dimension | |----------|--------------|---------------------| | `SigLIP/siglip2-vit-b-16` | ViT-B-16 | 768 | | `SigLIP/siglip2-vit-l-16` | ViT-L-16 | 1024 | | `SigLIP/siglip2-so400m-patch16-384` | ViT-So400M | 1152 | CLIP models with sigmoid loss function. ### BLIP-2 (Semantic Search / Retrieval) | Model ID | Architecture | Embedding Dimension | HuggingFace Model | Handler | |----------|--------------|---------------------|-------------------|---------| | `Blip2/blip2_transformers` | BLIP-2 + Q-Former | 256 | `Salesforce/blip2-itm-vit-g` | Transformers | The BLIP-2 handler uses `Blip2ForImageTextRetrieval` from HuggingFace Transformers with projection layers (768D→256D) to generate compact embeddings. ### Qwen Text Embeddings | Model ID | Hugging Face Repo | Embedding Dimension | Precision | Notes | |----------|-------------------|---------------------|-----------|-------| | `QwenText/qwen3-embedding-0.6b` | `Qwen/Qwen3-Embedding-0.6B` | 1024 | INT8 | Text-only, instruction-aware, Context Length: 32k | | `QwenText/qwen3-embedding-4b` | `Qwen/Qwen3-Embedding-4B` | 2560 | INT8 | Text-only, instruction-aware, Context Length: 32k | | `QwenText/qwen3-embedding-8b` | `Qwen/Qwen3-Embedding-8B` | 4096 | INT8 | Text-only, instruction-aware, Context Length: 32k | The Qwen text embedding handler provides high-quality multilingual embeddings optimised with OpenVINO. These models: - Are **text-only** and do not expose image or video encoders. - Automatically wrap queries using the recommended instruction template (`"Instruct: {task_description}\nQuery:{query}"`). - Convert to OpenVINO INT8 format on first use and store compiled artifacts under the configured `EMBEDDING_OV_MODELS_DIR`. - Require `trust_remote_code=true` (handled by the factory). - Support Intel GPU execution via OpenVINO. Use the `/model/capabilities` endpoint to inspect which modalities the currently loaded model supports. ## Model Configuration Set your chosen model using environment variables: ```bash # Example: Using BLIP-2 (Transformers) export EMBEDDING_MODEL_NAME="Blip2/blip2_transformers" # Example: Using CLIP export EMBEDDING_MODEL_NAME="CLIP/clip-vit-b-16" # Example: Using MobileCLIP export EMBEDDING_MODEL_NAME="MobileCLIP/mobileclip_s0" # Example: Using Qwen text embeddings (INT8 OpenVINO) export EMBEDDING_MODEL_NAME="QwenText/qwen3-embedding-0.6b" export EMBEDDING_USE_OV=true export EMBEDDING_DEVICE=GPU # or CPU/AUTO export EMBEDDING_OV_MODELS_DIR=/app/ov_models source setup.sh ``` All models support OpenVINO optimization for Intel hardware acceleration: ```bash export EMBEDDING_USE_OV=true export EMBEDDING_DEVICE=CPU # or GPU ``` ## OpenVINO Conversion Support The service supports automatic OpenVINO conversion for all models. The conversion process automatically detects whether a model has HuggingFace Hub support and uses the appropriate conversion method. ## Supported Input Formats - **Text**: UTF-8 strings (available for all models) - **Images**: JPEG, PNG, WebP, base64-encoded (and other formats supported by PIL). _Not available for Qwen text-only models._ - **Videos**: Any format supported by FFmpeg (MP4, AVI, MOV, etc.), base64-encoded. _Not available for Qwen text-only models._ All models are compatible with the OpenAI embeddings API format. ## API Usage Query available models: ```bash curl http://localhost:9777/model/list ``` Get current model information: ```bash curl http://localhost:9777/model/current ``` Inspect modality support for the active model: ```bash curl http://localhost:9777/model/capabilities ``` ## Related Documentation - [Overview](./index.md): Architecture and capabilities overview - [Get Started](./get-started.md): Step-by-step deployment instructions - [SDK Usage](./sdk-usage.md): Python SDK integration guide