Quick Start: Live Video Captioning#

Get the application up and running with USB/webcam in 5 steps!

Live Video Captioning uses VLM to automatically describe what is happening in a live video stream — from a camera or an RTSP feed — and displays those descriptions in real time on a web dashboard.

Note:

The time taken is a function of network bandwidth. Model and image download time will determine how fast the user is up and running with the application.

If there is no USB/webcam device attached, user can configure a test RTSP stream following these instructions.

Before You Begin#

Make sure your machine meets these minimums:

What	Minimum
Processor	Intel® Core™ Ultra (2nd or 3rd gen) with integrated GPU
Memory	Min 16 GB RAM
Disk	64 GB free SSD space
OS	Ubuntu 24.04 or 24.10
Internet	Required for first-time setup

You also need Docker installed. If you do not have it yet, run the following two commands in a terminal:

curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker $USER

After running those commands, log out and log back in for the changes to take effect.

Step 1 — Get the Code#

Open a terminal and run:

git clone --filter=blob:none --sparse --branch main https://github.com/open-edge-platform/edge-ai-suites.git
cd edge-ai-suites
git sparse-checkout set metro-ai-suite
cd metro-ai-suite/live-video-analysis/live-video-captioning

Step 2 — Set Up Configuration#

Run the setup script — it automatically detects your machine’s IP address and prepares the configuration file:

bash scripts/setup_env.sh

Step 3 — Download the AI Model (one-time, ~5 min)#

This downloads the AI model that powers the captions. It only needs to run once. The model parameter is configurable and the user is requested to confirm the license agreement before the download.

./model_download_scripts/download_models.sh \
  --model OpenGVLab/InternVL2-1B \
  --type vlm \
  --weight-format int8

Specifying the conversion device#

By default, the model is converted to run on CPU. To explicitly run on other device:

# Specify your desired target device in the --device flag
./model_download_scripts/download_models.sh \
  --model OpenGVLab/InternVL2-1B \
  --type vlm \
  --weight-format int8 \
  --device <CPU|GPU|NPU>

Note: NPU currently requires int4 quantization for VLM conversion. If you pass --device NPU with int8 or fp16, the script automatically overrides it to int4.

Step 4 — Start the Application#

docker compose up -d

Docker pulls the required containers and starts all services in the background. The first run may take a few minutes to download images.

Step 5 — Open the Dashboard#

Once the services are running, open a web browser and go to:

http://<YOUR_IP>:4173

Replace <YOUR_IP> with the IP address shown at the end of Step 2, or find it by running hostname -I in the terminal.

Using the Dashboard#

Enter your video source — paste an RTSP camera URL (for example rtsp://192.168.1.10/stream) or select the USB/webcam device in case it is available.
Select the VLM Device - choose the hardware device on which the VLM model will run (e.g., “CPU”, “GPU”, “NPU”). The available options will vary depending on the devices present on your host system.
Select a model — choose from the available AI models in the drop-down list.
Click Start — captions appear alongside the live video preview.

Step 6 — Stop the Application#

When you are done, stop all services with:

docker compose down

Troubleshooting#

Problem	What to try
Dashboard does not load	Wait 30 seconds after `docker compose up -d`, then refresh
No captions appear	Check that the RTSP URL is reachable from this machine
Stream behind a proxy	Add the camera’s IP to `no_proxy` in your shell environment
“permission denied” with Docker	Run `sudo usermod -aG docker $USER`, then log out and back in
“failed to resolve reference docker.io” with Docker	Docker daemon cannot reach Docker Hub over the network to download the microservices. This could be due to missing organization proxy configuration in docker setup. Follow this instruction to set it up.
Hardware-encoded camera not supported	This application does not supported hardware-encoded format webcam (for example, H.264). Use a compatible webcam that provides raw video output(for example, YUYV/MJPEG).
Model download fails with authentication error	Set the `HUGGINGFACEHUB_API_TOKEN` environment variable and rerun the command.
Model download interrupted or fails due to network	Remove the `ovms_model` folder and the model-specific folder (`ov_models/` for VLMs), then rerun the command. The download container is automatically cleaned up when the helper exits.

Next Steps#

Once you are comfortable with the basics:

System Requirements — full hardware and software details
Get Started — complete setup guide with all configuration options
How It Works — understand the architecture behind the application