Understanding the Role of Data Normalization and Standardization in Machine Learning

Why do we scale features?

For machine learning, every dataset does not require feature scaling, and it is only needed when features have different ranges.

For example, consider a data set containing two features, age(x1) and income(x2), where age ranges from 0–100, while income ranges from 0–20,000 and higher. Income is about 1,000 times larger than age and ranges from 20,000–500,000. So, these two features are in very different ranges. When we do further analysis, like multivariate linear regression, the attributed income will influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor.

Because different features do not have similar ranges of values, gradients may take a long time, oscillate back and forth, and take a long time before they can finally find their way to the global/local minimum. To overcome the model learning problem, we normalize the data. We ensure that the different features take on similar ranges of values so that gradient descents can converge more quickly.

When to use Normalization?

Normalization typically means rescaling the values into a range of [0,1].

Normalization is an excellent technique to use when you do not know the distribution of your data or when you know the distribution is not Gaussian (a bell curve). Normalization is useful when your data has varying scales, and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.

When to use Standardization?

Standardization: typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).

Standardization assumes that your data has a Gaussian (bell curve) distribution. This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian. Standardization is useful when your data has varying scales, and your algorithm makes assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis.

Author: Sadman Kabir Soumik

Posts in this Series

comments powered by Disqus