Igor Rivin
Igor Rivin

Reputation: 4864

going from numpy array to a pandas dataframe changes values

consider the following code:

dog = np.random.rand(10, 10)
frog = pd.DataFrame(dog, columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
from sklearn.preprocessing import StandardScaler
slog = StandardScaler()
mog = slog.fit_transform(frog.values)
frog[frog.columns] = mog

OK, now we should have a dataframe whose values should be the standard-scaled array. But:

frog.describe()

gives:

[![describe the dataframe][1]][1]

Note that the standard deviation is 1.05

While

np.std(mog, axis=0)

Gives the expected:

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

What gives?

Upvotes: 0

Views: 58

Answers (1)

Warren Weckesser
Warren Weckesser

Reputation: 114811

The standard deviation computed by the describe method uses the sample standard deviation, while StandardScaler uses the population standard deviation. The only difference between the two is whether the sum of the squared differences from the mean is divided by n-1 (for the sample st. dev.) or n (for the pop. std. dev.).

numpy.std computes the population st. dev. by default, but you can use it to compute the sample st. dev. by adding the argument ddof=1, and the result agrees with the values computed by describe:

In [54]: np.std(mog, axis=0, ddof=1)
Out[54]: 
array([1.05409255, 1.05409255, 1.05409255, 1.05409255, 1.05409255,
       1.05409255, 1.05409255, 1.05409255, 1.05409255, 1.05409255])

Upvotes: 1

Related Questions