Release Notes: Multimodal Embedding Serving#
This microservice supports features based on the requirements of Video Search and Summarization sample application, which uses this microservice. Refer to Video Search and Summarization release notes for release details of this microservice.
Version 2026.1.0-rc1#
14 May 2026
New
Batched sampled-frame video pipeline: Video embedding requests (URL, base64, local file, RTSP) now process frames through a streaming batched-frame extraction path instead of extracting all frames upfront. Frame selection priority:
frame_indexes>extraction_fps>num_frames>frame_interval. Setnum_frames: 0to process all frames in a video.PyAV-based video decoder with shared memory transport: Replaced
decordwith a customPyAV-based decoder (decoder.py) featuring a shared memory pool (SharedMemoryPool) for zero-copy frame metadata transport between pipeline stages. Supports file, URL, bytes, and RTSP sources with keyframe and uniform sampling strategies.RTSP multi-stream ingestion: Multiple parallel decoder instances for concurrent RTSP and file/bytes streams in a single request.
Async OpenVINO inference with static shape compilation: All model handlers now use
AsyncInferQueue-based batched OV inference. GPU/iGPU models are compiled to a static batch shape at load time for higher hardware utilization; dynamic-size inputs are handled via padding or splitting.Parallel image pre-processing: New
ParallelImagePreprocessorapplies thread-pool-based preprocessing in parallel while preserving batch order, decoupling preprocessing latency from inference.Inference metrics reporting: All model handlers expose an optional
metrics_out=Truemode onencode_image()that returns timing and throughput metrics for both OV and native PyTorch execution paths.Configurable embedding pipeline via environment variables: The following variables are now exposed and seeded by
setup.sh.
Improvements
PyTorch fallback for all model handlers: CLIP, SigLIP, MobileCLIP, BLIP2-Transformers, and CN-CLIP handlers transparently fall back to native PyTorch inference when OpenVINO is not configured, controlled via
EMBEDDING_USE_OV.Significant runtime memory reduction (up to 8–10×) and improved end-to-end throughput through the shared memory pipeline and async batched inference.
GPU deployment defaults are now automatically applied when
EMBEDDING_DEVICE=GPU:OV_PERFORMANCE_MODE=THROUGHPUT,INFER_BATCH_SIZE=64,VIDEO_FRAME_BATCH_SIZE=256.OpenVINO dependency bumped to
2026.1.0; Intel GPU driver updated to26.09.37435.Detailed logger format enriched with filename, function name, and line number for easier debugging.
get-started.mdupdated with full environment variable reference, preset configuration examples (GPU, high-throughput, memory-constrained, debug), and a performance tuning guide.
Breaking Changes
encode_image()no longer accepts a pre-processedtorch.Tensoras input; passPIL.ImageorList[PIL.Image]instead.decordhas been removed as a dependency; replace any directdecordusage withPyAV(avpackage).
Validated configuration
Intel® Xeon® 5 + Intel® Arc™ B580 GPU, Intel® Core™ Ultra Processors (Series 2 and 3)
Vanilla Kubernetes Cluster
Version 1.3.2#
20 March 2026
New
Support for Intel® Core™ Ultra Processors (Series 3)
Provided support for data and time based search queries
Validated configuration
Intel® Xeon® 5 + Intel® Arc™ B580 GPU, Intel® Core™ Ultra Processors (Series 2 and 3)
Vanilla Kubernetes Cluster