Reputation: 193
I have a dataframe with name(person_name), color(shirt_color) as columns Each person wears a shirt with a certain color on a particular day (number of days can be arbitrary)
eg input:
name color
----------------
John White
John White
John Blue
John Blue
John White
Tom Blue
Tom Blue
Tom Green
Tom Black
Jerry Black
Jerry Blue
Jerry Black
I need to find the best colored shirt worn by each person, according to best_color_order eg result:
name color
-------------
Jerry Blue
John White
Tom Green
I am doing the following :
best_color_order = ['White', 'Green', 'Blue', 'Black']
best_color_list = [[name, list(group['color']).sort(key=best_color_order.index)[0]]
for name, group in df.groupby(by='name', sort=False, as_index=False)]
best_color_df = pd.DataFrame(best_color_list, columns=['name', 'color'])
Is there a faster way to do this if I have millions of records?
Upvotes: 3
Views: 1000
Reputation: 51395
Convert the column color
to an ordered categorical so it will sort in your desired order, then sort the values by color
, and finally groupby
and take the first
value for each name:
best_color_order = ['White', 'Green', 'Blue', 'Black']
df['color'] = pd.Categorical(df['color'], categories = best_color_order, ordered=True)
df.sort_values('color').groupby('name').first()
color
name
Jerry Blue
John White
Tom Green
[EDIT]: A faster way might be to do the same, but instead of groupby, just drop the duplicate name
s and keep the first (which is the default for the function drop_duplicates
):
best_color_order = ['White', 'Green', 'Blue', 'Black']
df['color'] = pd.Categorical(df['color'], categories = best_color_order, ordered=True)
df.sort_values('color').drop_duplicates('name')
name color
0 John White
7 Tom Green
10 Jerry Blue
Upvotes: 3