Inference Optimization#

Once the model is optimized for the given scenario, you should ensure an efficient way of running it. In Open Edge Platform, OpenVINO toolkit is the backbone of all inference tasks, offering many ways for you to achieve the best results.

Device selection

Start by selecting the best device for the task, as CPU, GPU, and NPU may fit some usage scenarios better than others. They also differ in the range of supported models. To make an informed decision, you may be interested in the following articles:

Optimizing inference settings

A major choice in every AI application is whether to prioritize Throughput or Latency, as the methods to achieve them are often mutually exclusive. In OpenVINO, you can easily set the general preference to run inference optimized for one or the other, using high-level performance hints:

Low latency means that you get the inference result immediately, without a delay. This approach fits use cases that require real-time input processing, such as video analysis of parking space occupancy - to see a live camera feed with AI annotation, each frame needs to be delivered in nanoseconds.

High throughput means that you get a large number of results in the alloted time. This approach fits use cases that involve large amounts of data processed at once, such as analyzing a long recording of a CCTV camera to find all frames including a given car - analyzing such a large number of frames requires maximum efficiency and hardware utilization, but all results may become available together, once all frames are annotated.

To see what works best for you and learn more about runtime optimizations in OpenVINO, as well as other tools, check out:

Batching

Batching is an important inference optimization method closely related to both topics mentioned above. It means grouping input samples into larger chunks of data, which increases throughput and reduces both per-sample cost and memory transfers. On the downside, larger batches may raise latency and memory use, which means the feature is not suited for all scenarios. Batching is, of course, available in all relevant platform components: