TFLM Quantizations

In this study, we evaluate the performance of various quantization schemes using the TensorFlow Lite for Microcontrollers (TFLM) platform. We chose TFLM because it is the only framework that supports the majority of available quantization methods. (1)

TFLM supports basic, dynamic, int8, int8 only, 16x8, and 16x8 only quantizations, but does not support ~~float16~~. In contrast, Edge Impulse and Renesas eAI Translator only support basic and int8 only quantizations, while Ekkono supports only basic quantization.

All FC and CNN models were tested on the NUCLEO-L4R5ZI and RenesasRX65N boards.

CMSIS-NN

The CMSIS-NN library can help to accelerate the execution of quantized models (1). This library is designed for ARM Cortex-M microcontrollers and was used with the NUCLEO-L4R5ZI board. Renesas has developed its own CMSIS-NN library for RX microcontrollers, but it was not employed in our evaluations.

"CMSIS-NN supports int8 and int16 activations and int8 weights. It also supports int4 packed weights to some extent." reference

Model Type: Board:

Models

Error

Execution Time

Flash Size

RAM Usage

Summary

Model Correctness:
- The float16 quantization scheme is not supported by any of the tested frameworks and should be disregarded.
- The dynamic quantization also lacks a proper support (1). So, it is recommended to avoid using the dynamic quantization.
  1. In most cases, either the models fail to run on the boards, or they have an unacceptable error. Even when ignoring the errors, we cannot say that the dynamic quantization has any superiority over other types of quantizations.
- The RenesasRX65N board is unable to run some of the models. (1)
  1. The RenesasRX65N board cannot run the basic, int8 only, and 16x8 only versions of the FC_1 and FC_2 models. The program halts during their execution.
- The error rates of all other quantization schemes are acceptable. (1)
  1. basic is perfect, int8 variants and 16x8 variants have a negligible errors, 16x8 being better than int8.
Execution Time: It is complicated and hard to say which quantization scheme is better.
- FC Models: The only variant of int8 and 16x8 is faster.
  - Small Models: 16x8 only is the best.
  - Large Models: int8 only is the best.
- CNN Models: The basic model is very slow (1). The int8 and 16x8 variants are close to each other. (2)
  1. Because of using the CMSIS-NN, the quantized models (especially the CNN quantized models) experience a significant speedup on the NUCLEO-L4R5ZI board. This library is not utilized on the RenesasRX65N board.
  2. Still, to some point, following the same pattern as FC for small and large models.
Flash Size: The basic model gets worse as model size increases (1). The others are almost the same. (2)
1. All quantization schemes start with relatively similar flash sizes, but as the model scales, the basic version's flash size increases fourfold (four bytes per parameter), whereas the other variants increase by only one byte per parameter.
2. The only variants of int8 and 16x8 are slightly more efficient in terms of flash usage.
RAM Usage: It is complicated, but int8 only is either equally the best or better than all others.
Conclusion: The choice of quantization scheme depends on many factors, however, the int8 only quantization is a good choice for most cases. (1)
1. The basic and int8 only variants are the two model types that will be used in other studies.