Deploying VSS with vLLM on Kubernetes Using Helm#

Overview#

This guide covers deploying the Video Search and Summarization (VSS) application on Kubernetes using vLLM as the LLM inference backend. vLLM provides an OpenAI-compatible API for efficient CPU-based inference on Intel Xeon systems - no GPU required.

This is one of several supported deployment configurations. For an overview of all configurations (including OVMS, VLM Microservice, and GPU-based deployment), see Deploy with Helm. For a conceptual overview of how VSS works, see How It Works.


Prerequisites#

Hardware Requirements#

For best performance, Intel® Xeon® 6 Processors are recommended.

Component

Specification

CPU Cores

For optimal performance: 16 cores for vLLM, additional cores for ingestion, embedding, vectordb, and other microservices

RAM Memory

Minimum 256GB total system memory

Disk Space

Minimum 500GB (SSD recommended for optimal performance)

Storage

Dynamic storage provisioning capability (NFS or local storage)

Software Requirements#

Tool

Version

Installation Guide

Kubernetes

v1.24 or later

Kubernetes docs

kubectl

Latest

kubectl docs

Helm

v3.0 or later

Helm docs

Your cluster must support dynamic provisioning of Persistent Volumes. Confirm a default storage class is configured:

kubectl get storageclass

See the Kubernetes Dynamic Provisioning Guide if none is available.


Step 1: Acquire the Helm Chart#

# Clone the repository (main branch)
git clone https://github.com/open-edge-platform/edge-ai-libraries.git edge-ai-libraries

# Navigate to the chart directory
cd edge-ai-libraries/sample-applications/video-search-and-summarization/chart

Step 2: Configure Required Values#

Open user_values_override.yaml in your editor:

nano user_values_override.yaml

Required Parameters#

Key

Description

Example Value

global.sharedPvcName

Name of the shared PVC for all components

vss-shared-pvc

global.huggingfaceToken

Hugging Face API token for model access

hf_xxxxxxxxxxxxxxxxxxxx

global.vlmName

Vision Language Model used for video analysis

Qwen/Qwen2.5-VL-3B-Instruct

global.env.POSTGRES_USER

PostgreSQL username

vsadmin

global.env.POSTGRES_PASSWORD

PostgreSQL password

<secure-password>

global.env.MINIO_ROOT_USER

MinIO username (min 3 chars)

minioadmin

global.env.MINIO_ROOT_PASSWORD

MinIO password (min 8 chars)

<secure-password>

global.env.RABBITMQ_DEFAULT_USER

RabbitMQ username

guest

global.env.RABBITMQ_DEFAULT_PASS

RabbitMQ password

<secure-password>

global.env.EMBEDDING_MODEL_NAME

Multimodal embedding model

CLIP/clip-vit-b-32 (search) or QwenText/qwen3-embedding-0.6b (unified)

For the full parameter catalog across all deployment modes, see Deploy with Helm.

Optional Parameters#

Key

Description

Example Value

global.keepPvc

Retain PVC on helm uninstall to avoid re-downloading models

true

global.proxy.http_proxy

HTTP proxy (if required by your environment)

http://proxy-example.com:000

global.proxy.https_proxy

HTTPS proxy (if required by your environment)

http://proxy-example.com:000

vLLM-Specific Parameters#

The xeon_vllm_values.yaml override file (included in the chart) pre-configures vLLM with sensible defaults for Intel Xeon. You can override individual values as needed:

Key

Description

Default

Notes

vllm.resources.requests.cpu

CPU request for the vLLM pod

16

Increase for higher throughput

vllm.resources.requests.memory

Memory request for the vLLM pod

128Gi

Increase for larger models

vllm.pvc.size

Model cache PVC size

80Gi

Increase for larger model footprints

vllm.modelCachePath

Model cache mount path in the pod

/cache/vllm

Uses shared PVC

Model selection: vllm.enabled: true (set by xeon_vllm_values.yaml) automatically disables the VLM Inference Microservice (vlminference.enabled: false). vLLM uses the model specified in global.vlmName; ensure it is compatible with vLLM and available on Hugging Face.

Step 3: Build Helm Dependencies#

From the chart directory, run:

helm dependency update

Verify all dependencies are resolved:

helm dependency list

Step 4: Create a Namespace#

export NAMESPACE=vss-deployment
kubectl create namespace ${NAMESPACE}

All subsequent commands assume the NAMESPACE variable is set in your shell session.


Step 5: Deploy with vLLM#

Choose the deployment mode that fits your use case. In both cases, xeon_vllm_values.yaml enables vLLM and configures resource allocations for Intel Xeon CPUs.

Switching modes: Always uninstall the current release before switching to a different mode:

helm uninstall vss -n ${NAMESPACE}

Option A: Video Summarization Only#

Deploys the summarization pipeline with vLLM for text generation.

helm install vss . \
  -f summary_override.yaml \
  -f xeon_vllm_values.yaml \
  -f user_values_override.yaml \
  -n ${NAMESPACE}

Option B: Unified Video Search and Summarization#

Deploys both the search and summarization pipelines with vLLM. Before installing, ensure global.env.TEXT_EMBEDDING_MODEL_NAME is set and global.embedding.preferTextModel: true in user_values_override.yaml (built into unified_summary_search.yaml).

helm install vss . \
  -f unified_summary_search.yaml \
  -f xeon_vllm_values.yaml \
  -f user_values_override.yaml \
  -n ${NAMESPACE}

Requirement: The chart will raise an error if global.env.TEXT_EMBEDDING_MODEL_NAME is omitted while unified mode is enabled. Review the supported model list in supported-models before choosing model IDs.

Understanding the override files:

File

Purpose

summary_override.yaml

Enables the summarization pipeline

unified_summary_search.yaml

Enables combined search and summarization; sets preferTextModel: true

xeon_vllm_values.yaml

Enables vLLM, disables VLM Microservice, sets Xeon-optimized resource allocations

user_values_override.yaml

Your credentials, model selections, and environment-specific overrides


Step 6: Verify the Deployment#

Monitor pod startup progress:

kubectl get pods -n ${NAMESPACE} -w

After a successful deployment, all pods should reach Running / 1/1 Ready state:

First-time startup: All pods can take up to 20–30 minutes to reach Running state because models (vLLM, embedding, object detection — up to ~50 GB total) are downloaded from Hugging Face and cached. Set global.keepPvc: true to skip model re-downloads on reinstallation.


Step 7: Access the Application#

Once all pods are running, retrieve the URL:

NGINX_HOST=$(kubectl get pods -l app=vss-nginx -n ${NAMESPACE} -o jsonpath='{.items[0].status.hostIP}')
NGINX_PORT=$(kubectl get service vss-nginx -n ${NAMESPACE} -o jsonpath='{.spec.ports[0].nodePort}')
echo "http://${NGINX_HOST}:${NGINX_PORT}"

Open the URL in your browser to access the VSS dashboard.


Managing the Deployment#

Upgrading#

After editing user_values_override.yaml, apply changes with:

helm upgrade vss . \
  -f summary_override.yaml \
  -f xeon_vllm_values.yaml \
  -f user_values_override.yaml \
  -n ${NAMESPACE}

Replace summary_override.yaml with unified_summary_search.yaml for the unified mode.


Troubleshooting#

For troubleshooting guidance, see Deploy with Helm — Troubleshooting.


Uninstallation#

helm uninstall vss -n ${NAMESPACE}

# Optional: delete the namespace
kubectl delete namespace ${NAMESPACE}

By default, PVCs are deleted with the Helm release. If you set global.keepPvc: true, PVCs are retained and reusable in future deployments to avoid re-downloading models.