Supported Models#
The Multimodal Embedding Serving microservice supports multiple vision-language models for generating embeddings from text, images, and videos.
Available Models#
CLIP (Contrastive Language-Image Pretraining)#
Model ID |
Architecture |
Embedding Dimension |
|---|---|---|
|
ViT-B-32 |
512 |
|
ViT-B-16 |
512 |
|
ViT-L-14 |
768 |
|
ViT-H-14 |
1024 |
Standard OpenAI CLIP models for general-purpose vision-language understanding.
CN-CLIP (Chinese CLIP)#
Model ID |
Architecture |
Embedding Dimension |
|---|---|---|
|
ViT-B-16 |
512 |
|
ViT-L-14 |
768 |
|
ViT-H-14 |
1024 |
Chinese-optimized CLIP models supporting both Chinese and English text.
MobileCLIP#
Model ID |
Architecture |
Embedding Dimension |
|---|---|---|
|
MobileCLIP-S0 |
512 |
|
MobileCLIP-S1 |
512 |
|
MobileCLIP-S2 |
512 |
|
MobileCLIP-B |
512 |
|
MobileCLIP-BLT |
512 |
Lightweight CLIP models designed for mobile and edge deployment.
SigLIP#
Model ID |
Architecture |
Embedding Dimension |
|---|---|---|
|
ViT-B-16 |
768 |
|
ViT-L-16 |
1024 |
|
ViT-So400M |
1152 |
CLIP models with sigmoid loss function.
BLIP-2 (Semantic Search / Retrieval)#
Model ID |
Architecture |
Embedding Dimension |
HuggingFace Model |
Handler |
|---|---|---|---|---|
|
BLIP-2 + Q-Former |
256 |
|
Transformers |
The BLIP-2 handler uses Blip2ForImageTextRetrieval from HuggingFace Transformers with projection layers (768D→256D) to generate compact embeddings.
Qwen Text Embeddings#
Model ID |
Hugging Face Repo |
Embedding Dimension |
Precision |
Notes |
|---|---|---|---|---|
|
|
1024 |
INT8 |
Text-only, instruction-aware, Context Length: 32k |
|
|
2560 |
INT8 |
Text-only, instruction-aware, Context Length: 32k |
|
|
4096 |
INT8 |
Text-only, instruction-aware, Context Length: 32k |
The Qwen text embedding handler provides high-quality multilingual embeddings optimised with OpenVINO. These models:
Are text-only and do not expose image or video encoders.
Automatically wrap queries using the recommended instruction template (
"Instruct: {task_description}\nQuery:{query}").Convert to OpenVINO INT8 format on first use and store compiled artifacts under the configured
EMBEDDING_OV_MODELS_DIR.Require
trust_remote_code=true(handled by the factory).Support Intel GPU execution via OpenVINO.
Use the /model/capabilities endpoint to inspect which modalities the currently loaded model supports.
Model Configuration#
Set your chosen model using environment variables:
# Example: Using BLIP-2 (Transformers)
export EMBEDDING_MODEL_NAME="Blip2/blip2_transformers"
# Example: Using CLIP
export EMBEDDING_MODEL_NAME="CLIP/clip-vit-b-16"
# Example: Using MobileCLIP
export EMBEDDING_MODEL_NAME="MobileCLIP/mobileclip_s0"
# Example: Using Qwen text embeddings (INT8 OpenVINO)
export EMBEDDING_MODEL_NAME="QwenText/qwen3-embedding-0.6b"
export EMBEDDING_USE_OV=true
export EMBEDDING_DEVICE=GPU # or CPU/AUTO
export EMBEDDING_OV_MODELS_DIR=/app/ov_models
source setup.sh
All models support OpenVINO optimization for Intel hardware acceleration:
export EMBEDDING_USE_OV=true
export EMBEDDING_DEVICE=CPU # or GPU
OpenVINO Conversion Support#
The service supports automatic OpenVINO conversion for all models. The conversion process automatically detects whether a model has HuggingFace Hub support and uses the appropriate conversion method.
Supported Input Formats#
Text: UTF-8 strings (available for all models)
Images: JPEG, PNG, WebP, base64-encoded (and other formats supported by PIL). Not available for Qwen text-only models.
Videos: Any format supported by FFmpeg (MP4, AVI, MOV, etc.), base64-encoded. Not available for Qwen text-only models.
All models are compatible with the OpenAI embeddings API format.
API Usage#
Query available models:
curl http://localhost:9777/model/list
Get current model information:
curl http://localhost:9777/model/current
Inspect modality support for the active model:
curl http://localhost:9777/model/capabilities