Reputation: 45
I'm trying to run a t-test in pandas/statsmodels to compare differences in performance between two groups, but I'm having difficulty formatting the data in a way that statsmodels can use (in a reasonable way).
My pandas dataframe currently looks like this:
Treatment Performance
a 2
b 3
a 2
a 1
b 0
And it's my understanding that to perform a t-test I need the data organized by treatment, like so:
TreatmentA TreatmentB
2 3
2 0
1
This code almost does the trick:
cat1 = df.groupby('Treatment', as_index=False).groups['a']
cat2 = df.groupby('Treatment', as_index=False).groups['b']
print(ttest_ind(cat1, cat2))
But when I print, it looks like it's pulling the indices where that treatment occurred instead of the performance values:
print(cat1)
[0, 2, 4, 5, 9, 10, 11, 16, 18,...131, 133, 142, 147, 152, 153, 156, 157, 158]
It [maybe?] needs to be something more like this:
print(cat1)
[2, 2, 1, ...0, 3, 1, 1, 0, 2, 0, 0, 0]
What is the best way to convert this dataframe into a format that I can perform t-tests on?
Upvotes: 1
Views: 916
Reputation: 3947
I think the simplest way is to do it like this:
ttest_ind(df[df['Treatment'] == 'a']['Performance'], df[df['Treatment'] == 'b']['Performance'])
Hope it helps.
Upvotes: 1