gogurt
gogurt

Reputation: 831

Pandas: sort within groupby on a particular column

The data set I'm working with is here. It's a data set of Chipotle orders. There are item_name for each type of item sold, and then within item_name there are multiple possible choice_descriptions describing exactly what is in the item.

So let's import the file first and clean it up a bit:

chipo = pd.read_csv('chipotle.tsv', sep='\t')
del chipo['order_id']
del chipo['quantity']
chipo['revenue'] = chipo['item_price'].str.replace('$','').astype(float)
chipo['count'] = 1

I'm trying to tabulate how many of each choice_description is ordered per team item_name. This is easy enough by using a groupby and then summing the columns (including the column of ones),

grouped = chipo.groupby(['item_name','choice_description']).sum()

but now I want to sort the counts within 'item_name'. Oh, also I want to keep the item_price column for possible later calculation, so I can't just convert the DataFrame to a one-column frame. For the life of me I can't figure out how to do this.

I can't just call grouped.sort_values('count') because it sorts over all observations, not just within the item_name groups. How can I just sort by that column but keep the sorting constrained within the group?

Even though there seem to previous answers discussing this (e.g. here), all of them either use deprecated sort commands (e.g. Series.order() or DataFrame.sort()) or don't do exactly what I need. How can I perform this seemingly simple task?

Upvotes: 2

Views: 1132

Answers (1)

unutbu
unutbu

Reputation: 879201

How can I just sort by that column but keep the sorting constrained within the group?

Sort by both item_name and count. Since item_name is an index level, and count is a column, we must either move item_name to a column and call DataFrame.sort_values or move count into the index and call DataFrame.sortlevel.

For example

import pandas as pd
chipo = pd.read_csv('chipotle.tsv', sep='\t')
del chipo['order_id']
del chipo['quantity']
chipo['revenue'] = chipo['item_price'].str.replace('$','').astype(float)
chipo['count'] = 1
result = chipo.groupby(['item_name','choice_description']).sum()

result = (result.set_index('count', append=True)
            .sortlevel(axis=0, level=['item_name', 'count'], ascending=False)
            .reset_index('count', drop=False))

yields

                                                                          revenue  \
                                                                      revenue  count
item_name             choice_description                                                
Veggie Salad Bowl     [Fresh Tomato Salsa, [Fajita Vegetables, Lettuc...    11.25      1
                      [Fresh Tomato Salsa, [Fajita Vegetables, Rice, ...    11.25      1
                      [Fresh Tomato Salsa, [Fajita Vegetables, Rice, ...    11.25      1
                      [Fresh Tomato Salsa, [Rice, Lettuce, Guacamole,...    11.25      1
...                      
Barbacoa Bowl         [Roasted Chili Corn Salsa, [Fajita Vegetables, ...    23.50      2
                      [Fresh Tomato Salsa, [Fajita Vegetables, Rice, ...     9.25      1
                      [Fresh Tomato Salsa, [Fajita Vegetables, Rice, ...     9.25      1
                      [Tomatillo Red Chili Salsa, [Fajita Vegetables,...    11.75      1
                      [Tomatillo Red Chili Salsa, [Rice, Black Beans,...     9.25      1
                      [Tomatillo Red Chili Salsa, [Rice, Cheese, Lett...     9.25      1
                      [Tomatillo Red Chili Salsa, [Rice, Pinto Beans,...     9.25      1
                      [[Tomatillo-Green Chili Salsa (Medium), Roasted...    11.48      1
                      [[Tomatillo-Red Chili Salsa (Hot), Tomatillo-Gr...     8.99      1
6 Pack Soft Drink     [Diet Coke]                                           19.47      3
                      [Coke]                                                 6.49      1
                      [Sprite]                                               6.49      1

[314 rows x 2 columns]

Of the two options above, using sortlevel is a bit quicker than sort_values. Here is a benchmark:

In [73]: %timeit using_sortlevel(totals)
10 loops, best of 3: 148 ms per loop

In [74]: %timeit using_sort_values(totals)
10 loops, best of 3: 174 ms per loop

using this setup:

import pandas as pd
N = 10**6
chipo = pd.DataFrame(np.random.randint(1000, size=(N,4)), columns=list('ABCD'))
totals = chipo.groupby(['A','B']).sum()

def using_sortlevel(df):
    return (df.set_index('C', append=True)
            .sortlevel(axis=0, level=['A', 'C'], ascending=False)
            .reset_index('C', drop=False))

def using_sort_values(df):
    return (df.reset_index('A')
            .sort_values(by=['A', 'C'], ascending=False)
            .set_index('A', append=True)
            .swaplevel(0, 1, axis=0))

Upvotes: 2

Related Questions