luckyluke
luckyluke

Reputation: 654

Pandas variance and Standard deviation result differing with manual calculation

I'm trying to fin Mean, Variance and SD using pandas. However, manual calcuation is different from that of pandas output. is there anything i'm missing using pandas. Attached the xl screenshot for referenceMean=394, Variance21704, SD=147.32

import pandas as pd

dg_df = pd.DataFrame(
            data=[600,470,170,430,300],
            index=['a','b','c','d','e'])

print(dg_df.mean(axis=0)) # 394.0 matches with manual calculation
print(dg_df.var())        # 27130.0 not matching with manual calculation 21704
print(dg_df.std(axis=0))  # 164.71187 not matching with manual calculation 147.32

Upvotes: 4

Views: 11272

Answers (3)

Rudolf Morkovskyi
Rudolf Morkovskyi

Reputation: 869

You also can use dg_df.describe(), then have next dataframe. Maybe more visual

count   5.00000
mean    394.00000
std 164.71187
min 170.00000
25% 300.00000
50% 430.00000
75% 470.00000
max 600.00000

And you can get the right data like dg_df.describe().loc['count']

Upvotes: 1

jpp
jpp

Reputation: 164613

There is more than one definition of standard deviation. You are calculating the equivalent of Excel STDEV.P, which has the description: "Calculates standard deviation based on the entire population...". If you need sample standard deviation in Excel use STDEV.S.

pd.DataFrame.std assumes 1 degree of freedom by default, also known as sample standard deviation.

numpy.std assumes 0 degree of freedom by default, also known as population standard deviation.

See Bessel's correction to understand the difference between sample and population.

You can also specify ddof=0 with Pandas std / var methods:

dg_df.std(ddof=0)
dg_df.var(ddof=0)

Upvotes: 6

jezrael
jezrael

Reputation: 862406

Change default parameter ddof=1 (Delta Degrees of Freedom) to 0 in DataFrame.var and also in DataFrame.std, parameter axis=0 is default, so should be omitted:

print(dg_df.mean())
0    394.0
dtype: float64

print(dg_df.var(ddof=0))  
0    21704.0
dtype: float64

print(dg_df.std(ddof=0))
0    147.322775
dtype: float64

Upvotes: 4

Related Questions