Reputation: 639
I have a DataFrame that looks like so:
import pandas as pd
import numpy as np
rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 10,
'B': [rand.choice(['cat', 'dog', 'fish', 'pig', 'cow']) for i in range(30)],
'C': 1})
>> df.head(5)
A B C
0 foo pig 1
1 bar cow 1
2 baz cat 1
3 foo dog 1
4 bar pig 1
I then group by the different combinations to get counts, which I order descending by group, like so:
>> d = df.groupby(['A','B']).sum();
>> d = d.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('C', ascending=False)); d
C
A B
bar dog 4
cow 2
fish 2
cat 1
pig 1
baz cow 4
cat 3
fish 2
dog 1
foo dog 4
cow 3
pig 2
cat 1
What I want now is, for each group in A, keep the top 2 and summerise the rest as "Other". I have a function summarise()
which sort of works:
def summarise(l, n=10, name='Other'):
h = l.head(n)
idx = l.index[0]
if isinstance(idx, (list, tuple)):
prefix = list(idx[:-1])
else:
prefix = []
return h.append(pd.DataFrame([l.tail(-n).sum()], columns=l.columns, index=[tuple(prefix+[name])]))
>> summarise(d, n=2)
C
A B
bar dog 4
cow 2
Other 24
But if I try to use apply to do it for each group it blows up. It seems that the function gets passed a Series
instead?
The output I want is the the following:
A B C
bar dog 4
bar cow 2
bar Other 4
baz cow 4
baz cat 3
baz Other 3
foo dog 4
foo cow 3
foo Other 3
I'd have thought that d.groupby('A').tail(-2).sum()
would work, but it doesn't do what I expect.
Edit: Thanks to the answers I've come up with the following function that should help people in the future. A bit annoying that the cases for 1 and more columns are different, but so be it. Supports the top-N per group, but also a cut-off percentage. With this function I can slice-and-dice my data in many ways easily.
def top_per_group(df, cols, n=None, p=None, name='Other'):
d=df.groupby(cols).size().sort_values(ascending=False)
if len(cols) > 1:
d = d.sortlevel(0, sort_remaining=False)
d = d.reset_index()
if n:
if len(cols) > 1:
sel_list = d.groupby(cols[:-1]).cumcount()<n
else:
sel_list = d.index<n
else:
if len(cols) > 1:
sel_list = d.groupby(cols[:-1])[0].apply(lambda x: x/float(x.sum())) >= p
else:
sel_list = d[0].div(d[0].sum()) >= p
grouper = d[cols[-1]].where(sel_list, name)
return d.groupby(cols[:-1] + [grouper], sort=False).sum().reset_index()
Upvotes: 1
Views: 282
Reputation:
If you reset the index, you can create a grouper on the cumulative count:
d = d.reset_index()
grouper = d['B'].where(d.groupby('A').cumcount()<2, 'Other')
d.groupby(['A', grouper], sort=False).sum()
Out:
C
A B
bar dog 4
cow 2
Other 4
baz cow 4
cat 3
Other 3
foo dog 4
cow 3
Other 3
Or, with reset_index:
d.groupby(['A', grouper], sort=False).sum().reset_index()
Out:
A B C
0 bar dog 4
1 bar cow 2
2 bar Other 4
3 baz cow 4
4 baz cat 3
5 baz Other 3
6 foo dog 4
7 foo cow 3
8 foo Other 3
Upvotes: 2