Zhubarb
Zhubarb

Reputation: 11915

StandardScaler giving non-uniform standard deviation

My problem setup is as follows: Python 3.7, Pandas version 1.0.3, and sklearn version 0.22.1. I am applying a StandardScaler (to every column of a float matrix) per usual. However, the columns that I get out do not have standard deviation =1, while their mean values are (approximately) 0.

I am not sure what is going wrong here, I have checked whether the scaler got confused and standardised the rows instead but that does not seem to be the case.

from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
np.random.seed(1)
row_size = 5
n_obs = 100
X = pd.DataFrame(np.random.randint(0,1000,n_obs).reshape((row_size,int(n_obs/row_size)))

scaler = StandardScaler()
scaler.fit(X)
X_out = scaler.transform(X)
X_out = pd.DataFrame(X_out)

All columns have standard deviation 1.1180... as opposed to 1.

X_out[0].mean()
>>Out[2]: 4.4408920985006264e-17
X_out[0].std()
>>Out[3]: 1.1180339887498947

EDIT: I have realised as I increase row_size above, e.g. from 5 to 10 and 100, the standard deviation of the columns approach 1. So maybe this is to do with the bias of the variance estimator getting smaller as n increases(?). However it does not make sense that I can get unit variance by manually implementing (col[i]- col[i].mean() )/ col[i].std() but the StandardScaler struggles...

Upvotes: 4

Views: 1028

Answers (1)

Niklas Mertsch
Niklas Mertsch

Reputation: 1489

Numpy and Pandas use different definitions of standard deviation (biased vs. unbiased). Sklearn uses the numpy definition, thus the result of scaler.transform(X).std(axis=1) results in 1s.

But then you wrap the standardized values X_out in a pandas DataFrame and ask pandas to give you the standard deviation for the same values, which then results in your observation.

For most cases you only care for all columns having the same spread, thus the differences are not important. But if you really want the unbiased standard deviation, you can't use the StandardScaler from sklearn.

Upvotes: 3

Related Questions