Is it okay if I normalize whole dataset together?

I am trying to train a RNN network which employs LSTMs.

In data preprocessing part, when I normalize(feature scaling) the dataset, I am normalizing whole database together. However, I have serious doubts if some of the input columns are dominant on others, and it can effect the network training part. Here is an example of the dataset for better understanding:

Example part of the dataset

As you can see from the figure above, different colored columns are much more greater or lower than others.

So, my question is; is it okay if I normalize the whole dataset together, or should I normalize each columns individually?

Upvotes: 0

Views: 1423

Answers (1)

willk
willk

Reputation: 3817

Feature scaling is done on a per column basis. The operations are applied to one feature at a time because the objective is to get the different features into similar ranges so the unit of the feature does not impact learning (source). You are right that the magnitude of features can affect training and therefore scaling is considered a best practice especially when training neural networks.

Typically this is done in one of two ways:

  • Rescaling: making the values of a feature fall into a range, for example from 0 to 1. Min-Max rescaling accomplishes this by:

Rescaling

  • Standardization: subtracting the mean and dividing by the standard deviation. The new feature will have a mean of 0 and a standard deviation of 1.

Standardization

Rescaling can be done in Python using Scikit-Learn's MinMaxScaler. Standardization can be done in Python using Scikit-Learn's StandardScaler.

Here is a good article on the basics of feature scaling: http://sebastianraschka.com/Articles/2014_about_feature_scaling.html.

Upvotes: 2

Related Questions