Reputation: 1767
I’m having some problems using pandas to get the right standard error value for some data. Here is how to reproduce the problem.
import pandas as pd
# get the data
data = {'subject_number':[1,7,8,9,10,13],
'condition_number':[1,2,1,1,1,2],
'pre-score':[26.4495, 58.9345, 73.345, 41.081, 36.016, 8.4415],
'post-score':[49.71, 52.178, 44.0825, 52.711, 13.506, 39.7675]}
dataset = pd.DataFrame(data)
# get means
means = dataset.groupby('condition_number').mean()
means.drop('subject_number', axis=1)
# get stdevs
stdevs = dataset.groupby('condition_number').std()
stdevs.drop('subject_number', axis=1)
# get standard errors
sems = dataset.groupby('condition_number').sem()
sems.drop('subject_number', axis=1)
This is fine, and it all works, however I tested this in excel and found a discrepancy. The means and standard devs are fine, but the sem calculates the uncorrected (std / sqrt(n)) value, instead of the corrected value for sample (std / sqrt(n-1)). Here is the output in excel:
I think the problem might be something to do with the unequal n per condition? As we see in the dataset, the n for condition 1 is 4, while condition 2 n=2. [sorry, the dictionary assignment messed with the order of the pandas df columns...]
Can someone help to explain what is going on here?
Upvotes: 0
Views: 6538
Reputation: 76917
You need to modify your function with
In [846]: (dataset.groupby('condition_number')
.agg(lambda x: x.std()/x.count().add(-1).pow(0.5)))
Out[846]:
post-score pre-score subject_number
condition_number
1 10.405407 11.743628 2.357023
2 8.775549 35.703943 4.242641
Upvotes: 3