TFLM Pruning and Clustering

This study explores the impact of pruning and clustering on the performance of neural networks.

FC and CNN models were evaluated on the NUCLEO-L4R5ZI board, using 50% sparsity for pruning and clustering with 16 centroids.

Model Type:

Models

Error

Execution Time

Flash Size

RAM Usage

Summary

Despite the popularity of pruning and clustering in reducing the size of neural networks, our results demonstrate that these techniques don't improve model performance and, in fact, they increase the error rate. This happens because the pruned or clustered weights still need to be stored in memory (occupying the same space as the original weights), and operations involving these weights still need to be executed (e.g., x * 0 takes just as long as x * y).

One solution is to use structured pruning, which essentially involves designing a new model architecture—such as removing specific neurons or channels. Alternatively, you could use a hardware accelerator optimized for sparse weights or clustering. Another option is to “unfold” the matrix multiplication and eliminate unnecessary operations, though this approach requires a large amount of flash memory, making it impractical for most applications.

In conclusion, we advise against using unstructured pruning or clustering unless you have access to a hardware accelerator that specifically supports these techniques.