pandas standard error calculation issue

Question

I’m having some problems using pandas to get the right standard error value for some data. Here is how to reproduce the problem.

import pandas as pd

# get the data
data = {'subject_number':[1,7,8,9,10,13],
       'condition_number':[1,2,1,1,1,2],
       'pre-score':[26.4495, 58.9345, 73.345, 41.081, 36.016, 8.4415],
       'post-score':[49.71, 52.178, 44.0825, 52.711, 13.506, 39.7675]}

dataset = pd.DataFrame(data)

# get means
means = dataset.groupby('condition_number').mean()
means.drop('subject_number', axis=1)

# get stdevs
stdevs = dataset.groupby('condition_number').std()
stdevs.drop('subject_number', axis=1)

# get standard errors
sems = dataset.groupby('condition_number').sem()
sems.drop('subject_number', axis=1)

This is fine, and it all works, however I tested this in excel and found a discrepancy. The means and standard devs are fine, but the sem calculates the uncorrected (std / sqrt(n)) value, instead of the corrected value for sample (std / sqrt(n-1)). Here is the output in excel:

I think the problem might be something to do with the unequal n per condition? As we see in the dataset, the n for condition 1 is 4, while condition 2 n=2. [sorry, the dictionary assignment messed with the order of the pandas df columns...]

Can someone help to explain what is going on here?

Zero · Accepted Answer

You need to modify your function with

In [846]: (dataset.groupby('condition_number')
                  .agg(lambda x: x.std()/x.count().add(-1).pow(0.5)))
Out[846]:
                  post-score  pre-score  subject_number
condition_number
1                  10.405407  11.743628        2.357023
2                   8.775549  35.703943        4.242641

pandas standard error calculation issue

Answers (1)

Related Questions