Optimize Inference#

Quantization#

In machine learning, quantization reduces model size and speeds up inference by converting high-precision parameters (e.g., 32-bit floats) into lower-precision formats (e.g., 8-bit integers). It’s used in post-training or during training (quantization-aware training) to minimize accuracy loss while improving efficiency, memory usage, and hardware compatibility.

NNCF#

The default optimization tool used in OEP is NNCF (Neural Network Compression Framework). It is a set of compression algorithms, organized as a Python package, that make your models smaller and faster. Note that NNCF is not part of the OpenVINO package, so it needs to be installed separately. It supports models in PyTorch, TensorFlow , ONNX, and OpenVINO IR formats, offering the following main optimizations:

alt text

Weight Compression: an easy-to-use method for Large Language Model footprint reduction and inference acceleration.

Post-training Quantization: designed to optimize deep learning models by applying 8-bit integer quantization. Being the easiest way to optimize a model it does not require its retraining or fine-tuning but may result in a drop in accuracy. If the accuracy-performance tradeoff is not acceptable, Training-time Optimization may be a better option.

Training-time Optimization: involves a suite of advanced methods such as Structured or Unstructured Pruning, as well as Quantization-aware Training. This kind of optimization requires the use of the model’s original framework, for NNCF, it is either PyTorch or TensorFlow.

If you want to learn more, refer to NNCF documentation

Batching#

In inference, batching means grouping several input samples (e.g., images, text queries, or sensor readings) into one tensor and feeding them to the model together. For example:

Instead of predicting 1 image → 1 forward pass,
You predict 32 images at once → 1 forward pass for all 32.

Batching improves HW utilization and throughput while reducing per-sample cost and memory transfers. Larger batches increase efficiency but can raise latency and memory use.

How batching works in OpenVINO#

OpenVINO has Automatic Batching Execution mode that can perform automatic batching on-the-fly, without programming effort from the user. It can be used directly as a virtual device or as an option for inference on CPU/GPU/NPU (by means of configuration/hint). To learn more, refer to OpenVINO docs.

How batching works in OVMS#

You can set a batch_size parameter when loading a model: if specified, the model will expect batches of that size. If not, the batch size defaults to whatever the model’s first input dimension is.
You can set batch_size = "auto". That makes OVMS dynamically adjust the batch size based on incoming request data.
There is support for dynamic batching / demultiplexing (via a DAG‑scheduler) so that input requests of varying batch sizes can be handled without re‑loading the model each time.
When using large‑language models (LLMs) or “continuous batching”, the server can aggregate multiple requests, optimize tensor shapes and memory reuse, achieving higher throughput.

To learn more, refer to OVMS docs

How batching works in DLStreamer#

DL Streamer supports a batch‑size property on inference elements (like gvadetect, gvaclassify, gvainference) that groups multiple frames into a single inference call.
If you set batch‑size = N, then the model invocation will collect N frames (or tensors) and run them together, rather than one‑by‑one.
In multi‑stream pipelines, where multiple video streams feed the same model, DL Streamer allows cross‑stream batching: frames from different streams can be collated into one batch if they share the same model‑instance‑id.

To learn more, refer to DLStreamer Performance Guide