sklearn StandardScaler outputting wrong matrix

[10 15  18  11]
[15 17  24  16]
[13 13  20  14]
[12 20  10  25]
[12 11  14  11]

I have this data, and I'm trying to scale it using sklearn.preprocessing.StandardScaler:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaled=scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled, columns = df.columns)
scaled_df.head()

This outputs:

array([[-1.32680694, -0.06401844,  0.16552118, -0.85248268],
       [ 1.73505523,  0.57616596,  1.40693001,  0.11624764],
       [ 0.20412415, -0.70420284,  0.57932412, -0.27124449],
       [-0.30618622,  1.53644256, -1.4896906 ,  1.85996222],
       [-0.30618622, -1.34438724, -0.66208471, -0.85248268]])

I know this is wrong since the cov matrix shows variance as 1.25, when by definition it should be 1. Also, the original data is correctly saved in the 'df' variable. If I standarize the data manually I get the correct result, so I really don't know what's going on with the scaler function.

Upvotes: 0

Views: 41

Answers (1)

StupidWolf
StupidWolf

Reputation: 46988

Most likely you are using the pandas method std, where the degree of freedom is by default 1. StandardScaler calls numpy.std which uses a degree of freedom of 0. If you set ddof = 0 it should work.

To illustrate:

data = [[10 ,15 , 18 , 11],[15, 17,  24 , 16],[13, 13 , 20 , 14],
[12,20,10 ,25],[12, 11 , 14 , 11]]

df = pd.DataFrame(data)
scaler = StandardScaler()
scaled=scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled, columns = df.columns)

scaled_df

          0         1         2         3
0 -1.477098 -0.064018  0.165521 -0.852483
1  1.600189  0.576166  1.406930  0.116248
2  0.369274 -0.704203  0.579324 -0.271244
3 -0.246183  1.536443 -1.489691  1.859962
4 -0.246183 -1.344387 -0.662085 -0.852483

scaled_df.std()

0    1.118034
1    1.118034
2    1.118034
3    1.118034

scaled_df.std(ddof=0)

0    1.0
1    1.0
2    1.0
3    1.0

Upvotes: 1

Related Questions