Reputation: 831
The data set I'm working with is here. It's a data set of Chipotle orders. There are item_name
for each type of item sold, and then within item_name
there are multiple possible choice_description
s describing exactly what is in the item.
So let's import the file first and clean it up a bit:
chipo = pd.read_csv('chipotle.tsv', sep='\t')
del chipo['order_id']
del chipo['quantity']
chipo['revenue'] = chipo['item_price'].str.replace('$','').astype(float)
chipo['count'] = 1
I'm trying to tabulate how many of each choice_description
is ordered per team item_name
. This is easy enough by using a groupby and then summing the columns (including the column of ones),
grouped = chipo.groupby(['item_name','choice_description']).sum()
but now I want to sort the counts within 'item_name'
. Oh, also I want to keep the item_price
column for possible later calculation, so I can't just convert the DataFrame to a one-column frame. For the life of me I can't figure out how to do this.
I can't just call grouped.sort_values('count')
because it sorts over all observations, not just within the item_name
groups. How can I just sort by that column but keep the sorting constrained within the group?
Even though there seem to previous answers discussing this (e.g. here), all of them either use deprecated sort commands (e.g. Series.order()
or DataFrame.sort()
) or don't do exactly what I need. How can I perform this seemingly simple task?
Upvotes: 2
Views: 1132
Reputation: 879201
How can I just sort by that column but keep the sorting constrained within the group?
Sort by both item_name
and count
. Since item_name
is an index level, and count
is a column, we must either move item_name
to a column and call DataFrame.sort_values
or move count
into the index and call DataFrame.sortlevel
.
For example
import pandas as pd
chipo = pd.read_csv('chipotle.tsv', sep='\t')
del chipo['order_id']
del chipo['quantity']
chipo['revenue'] = chipo['item_price'].str.replace('$','').astype(float)
chipo['count'] = 1
result = chipo.groupby(['item_name','choice_description']).sum()
result = (result.set_index('count', append=True)
.sortlevel(axis=0, level=['item_name', 'count'], ascending=False)
.reset_index('count', drop=False))
yields
revenue \
revenue count
item_name choice_description
Veggie Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Lettuc... 11.25 1
[Fresh Tomato Salsa, [Fajita Vegetables, Rice, ... 11.25 1
[Fresh Tomato Salsa, [Fajita Vegetables, Rice, ... 11.25 1
[Fresh Tomato Salsa, [Rice, Lettuce, Guacamole,... 11.25 1
...
Barbacoa Bowl [Roasted Chili Corn Salsa, [Fajita Vegetables, ... 23.50 2
[Fresh Tomato Salsa, [Fajita Vegetables, Rice, ... 9.25 1
[Fresh Tomato Salsa, [Fajita Vegetables, Rice, ... 9.25 1
[Tomatillo Red Chili Salsa, [Fajita Vegetables,... 11.75 1
[Tomatillo Red Chili Salsa, [Rice, Black Beans,... 9.25 1
[Tomatillo Red Chili Salsa, [Rice, Cheese, Lett... 9.25 1
[Tomatillo Red Chili Salsa, [Rice, Pinto Beans,... 9.25 1
[[Tomatillo-Green Chili Salsa (Medium), Roasted... 11.48 1
[[Tomatillo-Red Chili Salsa (Hot), Tomatillo-Gr... 8.99 1
6 Pack Soft Drink [Diet Coke] 19.47 3
[Coke] 6.49 1
[Sprite] 6.49 1
[314 rows x 2 columns]
Of the two options above, using sortlevel
is a bit quicker than sort_values
.
Here is a benchmark:
In [73]: %timeit using_sortlevel(totals)
10 loops, best of 3: 148 ms per loop
In [74]: %timeit using_sort_values(totals)
10 loops, best of 3: 174 ms per loop
using this setup:
import pandas as pd
N = 10**6
chipo = pd.DataFrame(np.random.randint(1000, size=(N,4)), columns=list('ABCD'))
totals = chipo.groupby(['A','B']).sum()
def using_sortlevel(df):
return (df.set_index('C', append=True)
.sortlevel(axis=0, level=['A', 'C'], ascending=False)
.reset_index('C', drop=False))
def using_sort_values(df):
return (df.reset_index('A')
.sort_values(by=['A', 'C'], ascending=False)
.set_index('A', append=True)
.swaplevel(0, 1, axis=0))
Upvotes: 2