Reputation: 1
[10 15 18 11]
[15 17 24 16]
[13 13 20 14]
[12 20 10 25]
[12 11 14 11]
I have this data, and I'm trying to scale it using sklearn.preprocessing.StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled=scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled, columns = df.columns)
scaled_df.head()
This outputs:
array([[-1.32680694, -0.06401844, 0.16552118, -0.85248268],
[ 1.73505523, 0.57616596, 1.40693001, 0.11624764],
[ 0.20412415, -0.70420284, 0.57932412, -0.27124449],
[-0.30618622, 1.53644256, -1.4896906 , 1.85996222],
[-0.30618622, -1.34438724, -0.66208471, -0.85248268]])
I know this is wrong since the cov matrix shows variance as 1.25, when by definition it should be 1. Also, the original data is correctly saved in the 'df' variable. If I standarize the data manually I get the correct result, so I really don't know what's going on with the scaler function.
Upvotes: 0
Views: 41
Reputation: 46988
Most likely you are using the pandas method std, where the degree of freedom is by default 1. StandardScaler
calls numpy.std which uses a degree of freedom of 0. If you set ddof = 0
it should work.
To illustrate:
data = [[10 ,15 , 18 , 11],[15, 17, 24 , 16],[13, 13 , 20 , 14],
[12,20,10 ,25],[12, 11 , 14 , 11]]
df = pd.DataFrame(data)
scaler = StandardScaler()
scaled=scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled, columns = df.columns)
scaled_df
0 1 2 3
0 -1.477098 -0.064018 0.165521 -0.852483
1 1.600189 0.576166 1.406930 0.116248
2 0.369274 -0.704203 0.579324 -0.271244
3 -0.246183 1.536443 -1.489691 1.859962
4 -0.246183 -1.344387 -0.662085 -0.852483
scaled_df.std()
0 1.118034
1 1.118034
2 1.118034
3 1.118034
scaled_df.std(ddof=0)
0 1.0
1 1.0
2 1.0
3 1.0
Upvotes: 1