In pandas, how to top-ten groups of data in a DataFrame?

Question

I have a DataFrame that looks like so:

import pandas as pd
import numpy as np
rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 10,
                   'B': [rand.choice(['cat', 'dog', 'fish', 'pig', 'cow']) for i in range(30)],
                   'C': 1})

>> df.head(5)
     A    B  C
0  foo  pig  1
1  bar  cow  1
2  baz  cat  1
3  foo  dog  1
4  bar  pig  1

I then group by the different combinations to get counts, which I order descending by group, like so:

>> d = df.groupby(['A','B']).sum();
>> d = d.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('C', ascending=False)); d

          C
A   B      
bar dog   4
    cow   2
    fish  2
    cat   1
    pig   1
baz cow   4
    cat   3
    fish  2
    dog   1
foo dog   4
    cow   3
    pig   2
    cat   1

What I want now is, for each group in A, keep the top 2 and summerise the rest as "Other". I have a function summarise() which sort of works:

def summarise(l, n=10, name='Other'):
    h = l.head(n)

    idx = l.index[0]
    if isinstance(idx, (list, tuple)):
        prefix = list(idx[:-1])
    else:
        prefix = []
    return h.append(pd.DataFrame([l.tail(-n).sum()], columns=l.columns, index=[tuple(prefix+[name])]))

>> summarise(d, n=2)
            C
A   B        
bar dog     4
    cow     2
    Other  24

But if I try to use apply to do it for each group it blows up. It seems that the function gets passed a Series instead?

The output I want is the the following:

  A     B    C
bar   dog    4
bar   cow    2
bar   Other  4
baz   cow    4
baz   cat    3
baz   Other  3
foo   dog    4
foo   cow    3
foo   Other  3

I'd have thought that d.groupby('A').tail(-2).sum() would work, but it doesn't do what I expect.

Edit: Thanks to the answers I've come up with the following function that should help people in the future. A bit annoying that the cases for 1 and more columns are different, but so be it. Supports the top-N per group, but also a cut-off percentage. With this function I can slice-and-dice my data in many ways easily.

def top_per_group(df, cols, n=None, p=None, name='Other'):
    d=df.groupby(cols).size().sort_values(ascending=False)
    if len(cols) > 1:
        d = d.sortlevel(0, sort_remaining=False)

    d = d.reset_index()

    if n:
        if len(cols) > 1:
            sel_list = d.groupby(cols[:-1]).cumcount() 1:
            sel_list = d.groupby(cols[:-1])[0].apply(lambda x: x/float(x.sum())) >= p
        else:
            sel_list = d[0].div(d[0].sum()) >= p

    grouper = d[cols[-1]].where(sel_list, name)
    return d.groupby(cols[:-1] + [grouper], sort=False).sum().reset_index()

user2285236 · Accepted Answer

If you reset the index, you can create a grouper on the cumulative count:

d = d.reset_index()
grouper = d['B'].where(d.groupby('A').cumcount()<2, 'Other')
d.groupby(['A', grouper], sort=False).sum()
Out: 
           C
A   B       
bar dog    4
    cow    2
    Other  4
baz cow    4
    cat    3
    Other  3
foo dog    4
    cow    3
    Other  3

Or, with reset_index:

d.groupby(['A', grouper], sort=False).sum().reset_index()
Out: 
     A      B  C
0  bar    dog  4
1  bar    cow  2
2  bar  Other  4
3  baz    cow  4
4  baz    cat  3
5  baz  Other  3
6  foo    dog  4
7  foo    cow  3
8  foo  Other  3

In pandas, how to top-ten groups of data in a DataFrame?

Answers (1)

Related Questions