Reputation: 16508
I have a dataframe that contains number of observations per group of income:
INCAGG
1 6.561681e+08
3 9.712955e+08
5 1.658043e+09
7 1.710781e+09
9 2.356979e+09
I would like to compute the median income group. What do I mean? Let's start with a simpler series:
INCAGG
1 6
3 9
5 16
7 17
9 23
It represents this set of numbers:
1 1 1 1 1 1
3 3 3 3 3 3 3 3 3
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
Which I can reorder to
1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7
7 7 7 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
which visually is what I mean - the median here would be 7
.
Upvotes: 3
Views: 1725
Reputation: 30434
After glancing at a numpy example here, I think cumsum()
provides a good approach. Assuming your column of counts is called 'wt', here's a simple solution that will work most of the time (and see below for a more general solution):
df = df.sort('incagg')
df['tmp'] = df.wt.cumsum() < ( df.wt.sum() / 2. )
df['med_grp'] = (df.tmp==False) & (df.tmp.shift()==True)
The second code line above is dividing into rows above and below the median. The median observation will be in the first False
group.
incagg wt tmp med_grp
0 1 656168100 True False
1 3 971295500 True False
2 5 1658043000 True False
3 7 1710781000 False True
4 9 2356979000 False False
df.ix[df.med_grp,'incagg']
3 7
Name: incagg, dtype: int64
This will work fine when the median is unique and often when it isn't. The problem can only occur if the median is non-unique AND it falls on the edge of a group. In this case (with 5 groups and weights in the millions/billions), it's really not a concern but nevertheless here's a more general solution:
df['tmp1'] = df.wt.cumsum() == (df.wt.sum() / 2.)
df['tmp2'] = df.wt.cumsum() < (df.wt.sum() / 2.)
df['med_grp'] = (df.tmp2==False) & (df.tmp2.shift()==True)
df['med_grp'] = df.med_grp | df.tmp1.shift()
incagg wt tmp1 tmp2 med_grp
0 1 1 False True False
1 3 1 False True False
2 5 1 True False True
3 7 2 False False True
4 9 1 False False False
df.ix[df.med_grp,'incagg']
2 5
3 7
df.ix[df.med_grp,'incagg'].mean()
6.0
Upvotes: 1
Reputation: 109626
You can use chain from itertools. I used list comprehension to get a list of the aggregation group repeated the appropriate number of times, and then used chain to put it into a single list. Finally, I converted it to a Series and calculated the median:
from itertools import chain
df = pd.DataFrame([6, 9, 16, 17, 23], index=[1, 3, 5, 7, 9], columns=['counts'])
median = pd.Series([i for i in chain(*[[k] * v for k, v in zip(df.index, df.counts)])]).median()
>>> median
7.0
Upvotes: 0