Skip to content

TFLM Pruning and Clustering

This study explores the impact of pruning and clustering on the performance of neural networks.

FC and CNN models were evaluated on the NUCLEO-L4R5ZI board, using 50% sparsity for pruning and clustering with 16 centroids.


Model Type:

Models

FC parameters and MACs FC parameters and MACs
FC parameters and MACs
CNN parameters and MACs CNN parameters and MACs
CNN parameters and MACs

Error

FC error FC error
FC error
CNN error CNN error
CNN error

Execution Time

FC execution time FC execution time
FC execution time
CNN execution time CNN execution time
CNN execution time

Flash Size

FC flash size FC flash size
FC flash size
CNN flash size CNN flash size
CNN flash size

RAM Usage

FC RAM usage FC RAM usage
FC RAM usage
CNN RAM usage CNN RAM usage
CNN RAM usage

Summary

Despite the popularity of pruning and clustering in reducing the size of neural networks, our results demonstrate that these techniques don't improve model performance and, in fact, they increase the error rate. This happens because the pruned or clustered weights still need to be stored in memory (occupying the same space as the original weights), and operations involving these weights still need to be executed (e.g., x * 0 takes just as long as x * y).

One solution is to use structured pruning, which essentially involves designing a new model architecture—such as removing specific neurons or channels. Alternatively, you could use a hardware accelerator optimized for sparse weights or clustering. Another option is to “unfold” the matrix multiplication and eliminate unnecessary operations, though this approach requires a large amount of flash memory, making it impractical for most applications.

In conclusion, we advise against using unstructured pruning or clustering unless you have access to a hardware accelerator that specifically supports these techniques.