Machine Learning Model Compression Techniques - Reducing Size and Improving Performance

Author: Sadman Kabir Soumik

There are 4 main approaches you can consider for model compression. They are:

  1. Quantization
  2. Pruning
  3. Knowledge Distillation
  4. Low-Rank Factorization

Quantization

Quantization is the most general and commonly used model compression method. Quantization reduces a model’s size by using fewer bits to represent its parameters. By default, most software packages use 32 bits to represent a float number (single precision floating point). If a model has 100M parameters and each requires 32 bits to store, it’ll take up 400 MB.

If we use 16 bits to represent a number, we’ll reduce the memory footprint by half. Using 16 bits to represent a float is called half precision. Instead of using floats, you can have a model entirely in integers; each integer takes only 8 bits to represent. This method is also known as “fixed point.”

In fixed-point quantization, model parameters and activations are represented using a fixed number of bits, rather than the full precision of floating-point numbers. This allows for a trade-off between model size and performance, as using fewer bits can reduce the model's size but may also degrade its accuracy.

To apply quantization to a machine learning model in TensorFlow, you can use the tf.quantization module. This module provides functions and classes for quantizing both model parameters and activations, as well as for managing the resulting quantized models.

Here is an example of how you might use the tf.quantization module to quantize a simple TensorFlow model:

 1import tensorflow as tf
 2
 3# Build a simple model
 4inputs = tf.keras.Input(shape=(784,))
 5x = tf.keras.layers.Dense(128, activation='relu')(inputs)
 6x = tf.keras.layers.Dense(128, activation='relu')(x)
 7outputs = tf.keras.layers.Dense(10)(x)
 8model = tf.keras.Model(inputs=inputs, outputs=outputs)
 9
10# Convert the model to a quantized version
11converter = tf.lite.TFLiteConverter.from_keras_model(model)
12converter.optimizations = [tf.lite.Optimize.DEFAULT]
13quantized_model = converter.convert()

In this example, we use the TFLiteConverter class from the tf.lite module to convert the original model to a quantized version. We enable the default optimization settings, which includes quantization, and then use the convert method to generate the quantized model.

Once you have a quantized model, you can use it just like any other TensorFlow model, by loading it and using it for inference or further training.

Pruning

Pruning was a method originally used for decision trees where you remove sections of a tree that are uncritical and redundant for classification. As neural networks gained wider adoption, people started to realize that neural networks are over-parameterized and began to find ways to reduce the workload caused by the extra parameters.

There are several different ways to perform pruning, but one common approach is called "weight pruning." In weight pruning, the goal is to remove as many connections (i.e., weights) from the model as possible, while maintaining a certain level of performance. This is typically done by first training the model to convergence, and then pruning a certain percentage of the lowest-magnitude weights in each layer. The pruned weights are then set to zero, effectively removing them from the model.

Another approach to pruning is called "structured pruning," which involves removing entire units (e.g., neurons) or groups of units from the model. This can be done using a similar approach to weight pruning, where the lowest-performing units are identified and removed.

To apply pruning to a machine learning model in TensorFlow, you can use the tf.contrib.model_pruning module. This module provides functions and classes for pruning model parameters and for managing the resulting pruned models.

Here is an example of how you might use the tf.contrib.model_pruning module to prune a simple TensorFlow model:

 1import tensorflow as tf
 2
 3# Build a simple model
 4inputs = tf.keras.Input(shape=(784,))
 5x = tf.keras.layers.Dense(128, activation='relu')(inputs)
 6x = tf.keras.layers.Dense(128, activation='relu')(x)
 7outputs = tf.keras.layers.Dense(10)(x)
 8model = tf.keras.Model(inputs=inputs, outputs=outputs)
 9
10# Apply pruning to the model
11pruning_params = {
12    'pruning_schedule': tf.contrib.model_pruning.PolynomialDecay(
13        initial_sparsity=0.50,
14        final_sparsity=0.90,
15        begin_step=2000,
16        end_step=4000)
17}
18pruned_model = tf.contrib.model_pruning.prune_low_magnitude(model, **pruning_params)

In this example, we use the prune_low_magnitude function from the tf.contrib.model_pruning module to prune the weights in the model. We specify a pruning schedule using the PolynomialDecay class, which defines how the sparsity (i.e., the percentage of weights to be pruned) will change over time. In this case, we start with a sparsity of 50% and increase it to 90% over the course of 2000 to 4000 training steps.

Once you have a pruned model, you can use it just like any other TensorFlow model, by loading it and using it for inference or further training. You may also need to fine-tune the pruned model to restore its performance to the desired level.

Knowledge Distillation

Knowledge distillation is a technique for compressing machine learning models by training a smaller model to mimic the behavior of a larger, pre-trained model. The smaller model, also called the student model, is trained to reproduce the outputs of the larger, pre-trained model, known as the teacher model, on a set of training data.

In knowledge distillation, the student model is typically trained using a combination of the true labels for the training data and the output probabilities produced by the teacher model. This allows the student model to learn not only from the true labels, but also from the knowledge encoded in the teacher model's predictions.

Once the student model is trained, it can be used in place of the teacher model, providing a more efficient and compact alternative. The student model may not perform as well as the teacher model on the training data, but it should be able to generalize to unseen data in a similar way.

One example of a distilled network used in production is DistilBERT, which reduces the size of a BERT model by 40% while retaining 97% of its language understanding capabilities and being 60% faster.

Low-Rank Factorization

Low-rank factorization is a technique for compressing machine learning models by approximating a large, dense matrix with a much smaller, low-rank matrix. This can be done using a variety of methods, such as singular value decomposition (SVD) or matrix factorization.

In general, a low-rank matrix can be represented as the product of two much smaller matrices, one representing the "weights" and the other representing the "activations" of the original matrix. By using a low-rank approximation, it is possible to reduce the number of parameters in the model significantly, while maintaining a similar level of performance.

For example, by using a number of strategies including replacing 3 × 3 convolution with 1 × 1 convolution, SqueezeNets achieves AlexNet-level accuracy on ImageNet with 50 times fewer parameters.

Author: Sadman Kabir Soumik

References: Book - Designing Machine Learning Systems by Chip Huyen

Posts in this Series

comments powered by Disqus