sfjac
sfjac

Reputation: 7294

Pandas groupby sort within groups retaining multiple aggregates

I want to apply a sort and a limit within the groups returned by groupby, as in this question. However, I have multiple aggregates and I want all of the aggregates to be retained in the result.

Here is a simple example:

products = ["A", "B", "C", "D"]
stores = ["foo", "bar", "baz"]
n = 30

product_list = [products[i] for i in np.random.randint(0, len(products), n)]
store_list = [stores[i] for i in np.random.randint(0, len(stores), n)]
rating_list = np.random.random(n) * 5
sales_list = np.random.random(n) * 10000

df = pd.DataFrame(
    {'store': store_list, 
     'product': product_list, 
     'sales': sales_list, 
     'rating': rating_list})

df = df[['store', 'product', 'sales', 'rating']]

df[:5]

ungrouped dataframe

I want to group by store and product, and both sum and count the sales, while taking the mean of the rating.

This is straightforward:

dfg = df.groupby(['store', 'product']).agg({'sales': ['sum', 'count'], 
                                            'rating': 'mean'})

grouped

Now, I want to only keep the top two rated rows in each group. I can get this as follows (using a multi-level extension of the somewhat unintuitive [to me] incantations from 1):

g = dfg[('rating', 'mean')].groupby(
      level=0, group_keys=False).apply(
        lambda x: x.sort_values(ascending=False).head(2))
g

This returns the following Series:

store  product
bar    B          3.601135
       A          1.867449
baz    B          2.984196
       D          2.780500
foo    B          3.767912
       D          3.129346
Name: (rating, mean), dtype: float64

But I've lost the ('sales', 'sum') and ('sales', 'count') columns.

I suspect I need to extract g.index and use it in some way but have not been able to get this sorted (pun intended).

EDIT: The answers below manage to give the groups that I'm looking for, but what I'd really like is a stable sort where not only do I get the top N mean ratings within each group, but the groups themselves are sorted so that the first group has the highest rating, etc. To some extent this is just icing on the cake as I have the values I want now and would just like the report to be prettier.

Upvotes: 5

Views: 12532

Answers (2)

cs95
cs95

Reputation: 402323

This is how you'd do it, using sort_values + groupby + head -

dfg.sort_values(('rating', 'mean'), ascending=False)\
   .groupby(level=0)\
   .head(2)\
   .sort_index()

                 rating         sales      
                   mean           sum count
store product                              
bar   B        4.388521    636.813757     1
      C        3.931341   1843.772878     1
baz   A        2.501077  23110.162196     4
      B        3.339784  10610.257660     2
foo   B        2.785306  10315.968161     2
      D        2.160556  31084.181719     5

The final sort_index call sorts out the index-out-of-order problem (pun intended).

Upvotes: 3

sfjac
sfjac

Reputation: 7294

I've sorted it. Instead of indexing the grouped table and doing the subsequent groupby and sort_values as above, I needed to apply the sort_values to the un-indexed DataFrame, specifying the column to sort on explicitly:

g = dfg.groupby(level=0, group_keys=False).apply(
      lambda x: x.sort_values(('rating', 'mean'), ascending=False).head(2))

Giving me the desired result:

enter image description here

Upvotes: 5

Related Questions