fffrost
fffrost

Reputation: 1767

pandas standard error calculation issue

I’m having some problems using pandas to get the right standard error value for some data. Here is how to reproduce the problem.

import pandas as pd

# get the data
data = {'subject_number':[1,7,8,9,10,13],
       'condition_number':[1,2,1,1,1,2],
       'pre-score':[26.4495, 58.9345, 73.345, 41.081, 36.016, 8.4415],
       'post-score':[49.71, 52.178, 44.0825, 52.711, 13.506, 39.7675]}

dataset = pd.DataFrame(data)

# get means
means = dataset.groupby('condition_number').mean()
means.drop('subject_number', axis=1)

# get stdevs
stdevs = dataset.groupby('condition_number').std()
stdevs.drop('subject_number', axis=1)

# get standard errors
sems = dataset.groupby('condition_number').sem()
sems.drop('subject_number', axis=1)

This is fine, and it all works, however I tested this in excel and found a discrepancy. The means and standard devs are fine, but the sem calculates the uncorrected (std / sqrt(n)) value, instead of the corrected value for sample (std / sqrt(n-1)). Here is the output in excel:

enter image description here

I think the problem might be something to do with the unequal n per condition? As we see in the dataset, the n for condition 1 is 4, while condition 2 n=2. [sorry, the dictionary assignment messed with the order of the pandas df columns...]

Can someone help to explain what is going on here?

Upvotes: 0

Views: 6538

Answers (1)

Zero
Zero

Reputation: 76917

You need to modify your function with

In [846]: (dataset.groupby('condition_number')
                  .agg(lambda x: x.std()/x.count().add(-1).pow(0.5)))
Out[846]:
                  post-score  pre-score  subject_number
condition_number
1                  10.405407  11.743628        2.357023
2                   8.775549  35.703943        4.242641

Upvotes: 3

Related Questions