FooBar
FooBar

Reputation: 16508

Pandas: Weighted median of grouped observations

I have a dataframe that contains number of observations per group of income:

INCAGG
1         6.561681e+08
3         9.712955e+08
5         1.658043e+09
7         1.710781e+09
9         2.356979e+09

I would like to compute the median income group. What do I mean? Let's start with a simpler series:

INCAGG
1          6
3          9
5         16
7         17
9         23

It represents this set of numbers:

1 1 1 1 1 1
3 3 3 3 3 3 3 3 3
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

Which I can reorder to

1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7
7 7 7 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

which visually is what I mean - the median here would be 7.

Upvotes: 3

Views: 1725

Answers (2)

JohnE
JohnE

Reputation: 30434

After glancing at a numpy example here, I think cumsum() provides a good approach. Assuming your column of counts is called 'wt', here's a simple solution that will work most of the time (and see below for a more general solution):

df = df.sort('incagg')

df['tmp'] = df.wt.cumsum() < ( df.wt.sum() / 2. )

df['med_grp'] = (df.tmp==False) & (df.tmp.shift()==True)

The second code line above is dividing into rows above and below the median. The median observation will be in the first False group.

   incagg          wt    tmp med_grp
0       1   656168100   True   False
1       3   971295500   True   False
2       5  1658043000   True   False
3       7  1710781000  False    True
4       9  2356979000  False   False

df.ix[df.med_grp,'incagg']

3    7
Name: incagg, dtype: int64

This will work fine when the median is unique and often when it isn't. The problem can only occur if the median is non-unique AND it falls on the edge of a group. In this case (with 5 groups and weights in the millions/billions), it's really not a concern but nevertheless here's a more general solution:

df['tmp1']    = df.wt.cumsum() == (df.wt.sum() / 2.)
df['tmp2']    = df.wt.cumsum() < (df.wt.sum() / 2.)
df['med_grp'] = (df.tmp2==False) & (df.tmp2.shift()==True)
df['med_grp'] = df.med_grp | df.tmp1.shift()

   incagg  wt   tmp1   tmp2 med_grp
0       1   1  False   True   False
1       3   1  False   True   False
2       5   1   True  False    True
3       7   2  False  False    True
4       9   1  False  False   False

df.ix[df.med_grp,'incagg']
2    5
3    7

df.ix[df.med_grp,'incagg'].mean()
6.0

Upvotes: 1

Alexander
Alexander

Reputation: 109626

You can use chain from itertools. I used list comprehension to get a list of the aggregation group repeated the appropriate number of times, and then used chain to put it into a single list. Finally, I converted it to a Series and calculated the median:

from itertools import chain

df = pd.DataFrame([6, 9, 16, 17, 23], index=[1, 3, 5, 7, 9], columns=['counts'])

median = pd.Series([i for i in chain(*[[k] * v for k, v in zip(df.index, df.counts)])]).median()

>>> median
7.0

Upvotes: 0

Related Questions