Reputation: 23
I have a large dataframe similar to this one:
In [1]: grades
Out[1]:
course1 course2
school class student
school1 class1 student1 2 2
student2 3 2
student3 1 3
student4 3 1
student5 3 1
... ... ...
class3 student86 3 1
student87 2 2
student88 1 1
student89 3 3
student90 0 1
[90 rows x 2 columns]
I want to compute the Mann-Whitney rank test on the grades from the sample school and each sub-sample class. How can I do this using pandas and scipy.stats.mannwhitneyu without iterating through the dataframe?
Upvotes: 2
Views: 7694
Reputation: 251498
What you want to do is groupby
on the index levels and apply a function that calls mannwhitneyu
, passing the two columns course1
and course2
. Suppose this is your data:
index = pandas.MultiIndex.from_product([
['school{0}'.format(n) for n in xrange(3)],
['class{0}'.format(n) for n in xrange(3)],
['student{0}'.format(n) for n in xrange(10)]
])
d = pandas.DataFrame({'course1': np.random.randint(0, 10, 90), 'course2': np.random.randint(0, 10, 90)},
index=index)
Then to compute Mann-Whitney U by school:
>>> d.groupby(level=0).apply(lambda t: stats.mannwhitneyu(t.course1, t.course2))
school0 (426.5, 0.365937834646)
school1 (445.0, 0.473277409673)
school2 (421.0, 0.335714211748)
dtype: object
And to do it by class:
>>> d.groupby(level=[0, 1]).apply(lambda t: stats.mannwhitneyu(t.course1, t.course2))
school0 class0 (38.5, 0.200247279189)
class1 (37.0, 0.169040187814)
class2 (46.5, 0.409559639829)
school1 class0 (33.5, 0.110329749527)
class1 (47.5, 0.439276896563)
class2 (30.0, 0.0684355963119)
school2 class0 (47.5, 0.439438219083)
class1 (43.0, 0.308851989782)
class2 (34.0, 0.118791221444)
dtype: object
The numbers in the levels
argument to groupby
refer to the levels of your MultiIndex. So grouping by level 0 groups by school and grouping by levels 0 and 1 groups by school/class combination.
Upvotes: 7