Model Optimization#
Original deep learning model architectures tend to be large and very complex. In many cases, smaller and simplified versions can be used. They do the job equally well but perform much better - an 8-bit model will use a fraction of the memory required by an FP64 one. That is why you should always use a model that is optimized for your use case.
To do so, use Neural Network Compression Framework (NNCF), a collection of optimization algorithms that make your models smaller and faster. To learn more about it, check out NNCF documentation and articles on:
The easiest way to optimize a model, it does not require retraining or fine-tuning, it
just reduces the model size. Going from an FP64-based model to a quantized INT8 one
will greatly improve the file size, memory footprint, throughput and latency. It may
result in a drop in accuracy, though, which is why you should check if this
accuracy-performance tradeoff is acceptable.
An easy-to-use method targeting Large Language Models. It is a type of quantization that
compresses only part of the model, its weights, not activations. It provides increased
performance with relatively little impact on accuracy.
A more complex and time-consuming method involving multiple algorithms that are executed
while the model is retrained. It also requires the use of the model’s original framework,
for NNCF, it is either PyTorch or TensorFlow. With features such as Structured or
Unstructured Pruning and Quantization-aware Training, it will give you just the model
that fits your needs, optimally balancing its performance and accuracy.