Reputation: 4864
consider the following code:
dog = np.random.rand(10, 10)
frog = pd.DataFrame(dog, columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
from sklearn.preprocessing import StandardScaler
slog = StandardScaler()
mog = slog.fit_transform(frog.values)
frog[frog.columns] = mog
OK, now we should have a dataframe whose values should be the standard-scaled array. But:
frog.describe()
gives:
[![describe the dataframe][1]][1]
Note that the standard deviation is 1.05
While
np.std(mog, axis=0)
Gives the expected:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
What gives?
Upvotes: 0
Views: 58
Reputation: 114811
The standard deviation computed by the describe
method uses the sample standard deviation, while StandardScaler
uses the population standard deviation. The only difference between the two is whether the sum of the squared differences from the mean is divided by n-1
(for the sample st. dev.) or n
(for the pop. std. dev.).
numpy.std
computes the population st. dev. by default, but you can use it to compute the sample st. dev. by adding the argument ddof=1
, and the result agrees with the values computed by describe
:
In [54]: np.std(mog, axis=0, ddof=1)
Out[54]:
array([1.05409255, 1.05409255, 1.05409255, 1.05409255, 1.05409255,
1.05409255, 1.05409255, 1.05409255, 1.05409255, 1.05409255])
Upvotes: 1