Faster way to perform sort by index on pandas group

Question

I have a dataframe with name(person_name), color(shirt_color) as columns Each person wears a shirt with a certain color on a particular day (number of days can be arbitrary)

eg input:

name    color
----------------
John    White
John    White
John    Blue
John    Blue
John    White
Tom     Blue
Tom     Blue
Tom     Green
Tom     Black
Jerry   Black
Jerry   Blue
Jerry   Black

I need to find the best colored shirt worn by each person, according to best_color_order eg result:

name    color
-------------
Jerry   Blue
John    White
Tom     Green

I am doing the following :

best_color_order = ['White', 'Green', 'Blue', 'Black']

best_color_list = [[name, list(group['color']).sort(key=best_color_order.index)[0]]
                    for name, group in df.groupby(by='name', sort=False, as_index=False)]

best_color_df = pd.DataFrame(best_color_list, columns=['name', 'color'])

Is there a faster way to do this if I have millions of records?

sacuL · Accepted Answer

Convert the column color to an ordered categorical so it will sort in your desired order, then sort the values by color, and finally groupby and take the first value for each name:

best_color_order = ['White', 'Green', 'Blue', 'Black']

df['color'] = pd.Categorical(df['color'], categories = best_color_order, ordered=True)

df.sort_values('color').groupby('name').first()

       color
name        
Jerry   Blue
John   White
Tom    Green

[EDIT]: A faster way might be to do the same, but instead of groupby, just drop the duplicate names and keep the first (which is the default for the function drop_duplicates):

best_color_order = ['White', 'Green', 'Blue', 'Black']

df['color'] = pd.Categorical(df['color'], categories = best_color_order, ordered=True)

df.sort_values('color').drop_duplicates('name')

     name  color
0    John  White
7     Tom  Green
10  Jerry   Blue

Faster way to perform sort by index on pandas group

Answers (1)

Related Questions