sklearn customized standarization of data

Question

Suppose I have a 2D numpy array:

X = np.array[
    [..., ...],

    [..., ...]]

And I want to standardize the data either with:

X = StandardScaler().fit_transform(X)

or:

X = (X - X.mean())/X.std()

The results are different. Why are they different?

FChm · Accepted Answer

Assuming X is a feature matrix of shape (n x m) (n instances and m features). We want to scale each feature so its instances are distributed with a mean of zero and with unit variance.

To do this you need to calculate the mean and standard deviation of each feature for the provided instances (column of X) and then calculate the scaled feature vectors. Currently you are calculating the mean and standard deviation of the whole dataset and scaling the data using these values: this will give you meaningless results in all but a few special cases (i.e., X = np.ones((100,2)) is such a special case).

Practically, to calculate these statistics for each feature you will need to set the axis parameter of the .mean() or .std() methods to 0. This will perform the calculations along the columns and return a (1 x m) shaped array (actually a (m,) array, but thats another story), where each value is the mean or standard deviation for the given column. You can then use numpy broadcasting to correctly scale the feature vectors.

The below example shows how you can correctly implement it manually. x1 and x2 are 2 features with 100 training instances. We store them in a feature matrix X.

x1 = np.linspace(0, 100, 100)
x2 = 10 * np.random.normal(size=100) 
X = np.c_[x1, x2]

# scale the data using the sklearn implementation
X_scaled = StandardScaler().fit_transform(X)

# scale the data taking mean and std along columns
X_scaled_manual = (X - X.mean(axis=0)) / X.std(axis=0)

If you print the two you will see they match exactly, explicitly:

print(np.sum(X_scaled-X_scaled_manual))

returns 0.0.

sklearn customized standarization of data

Answers (1)

Related Questions