Reputation: 87
I used the sklearn standardscaler (mean removal and variance scaling) to scale a dataframe and compared it to a dataframe where I "manually" subtracted the mean and divided by the standard deviation. The comparison shows consistent small differences. Can anybody explain why? (The dataset I used is this: http://archive.ics.uci.edu/ml/datasets/Wine
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("~/DataSets/WineDataSetItaly/wine.data.txt", names=["Class", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"])
cols = list(df.columns)[1:] # I didn't want to scale the "Class" column
std_scal = StandardScaler()
standardized = std_scal.fit_transform(df[cols])
df_standardized_fit = pd.DataFrame(standardized, index=df.index, columns=df.columns[1:])
df_standardized_manual = (df - df.mean()) / df.std()
df_standardized_manual.drop("Class", axis=1, inplace=True)
df_differences = df_standardized_fit - df_standardized_manual
df_differences.iloc[:,:5]
Alcohol Malic acid Ash Alcalinity Magnesium
0 0.004272 -0.001582 0.000653 -0.003290 0.005384
1 0.000693 -0.001405 -0.002329 -0.007007 0.000051
2 0.000554 0.000060 0.003120 -0.000756 0.000249
3 0.004758 -0.000976 0.001373 -0.002276 0.002619
4 0.000832 0.000640 0.005177 0.001271 0.003606
5 0.004168 -0.001455 0.000858 -0.003628 0.002421
Upvotes: 5
Views: 6024
Reputation:
scikit-learn uses np.std which by default is the population standard deviation (where the sum of squared deviations are divided by the number of observations) and pandas uses the sample standard deviations (where the denominator is number of observations - 1) (see Wikipedia's standard deviation article). That's a correction factor to have an unbiased estimate of the population standard deviation and determined by the degrees of freedom (ddof
). So by default, numpy's and scikit-learn's calculations use ddof=0
whereas pandas uses ddof=1
(docs).
DataFrame.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
If you change your pandas version to:
df_standardized_manual = (df - df.mean()) / df.std(ddof=0)
The differences will be practically zero:
Alcohol Malic acid Ash Alcalinity of ash Magnesium
0 -8.215650e-15 -5.551115e-16 3.191891e-15 0.000000e+00 2.220446e-16
1 -8.715251e-15 -4.996004e-16 3.441691e-15 0.000000e+00 0.000000e+00
2 -8.715251e-15 -3.955170e-16 2.886580e-15 -5.551115e-17 1.387779e-17
3 -8.437695e-15 -4.440892e-16 3.164136e-15 -1.110223e-16 1.110223e-16
4 -8.659740e-15 -3.330669e-16 2.886580e-15 5.551115e-17 2.220446e-16
Upvotes: 16