trey hannam
trey hannam

Reputation: 263

Faster Way to Categorize Unique Row Combinations based on a Condition in Python Pandas?

I have a dataset that has the same layout as the one below. I want to categorize where each person went in their car for a given month.

Date Visitor Car Name Location
6-2022 Michael Chrysler Work
6-2022 Jim Subaru Home
6-2022 Jim Subaru Work
6-2022 Michael Chrysler Karate
7-2022 Michael Chrysler Work
7-2022 Jim Subaru Home
7-2022 Jim Subaru Work

The solution would look something like this:

Date Visitor Car Name Location
6-2022 Michael Chrysler Work & Karate
6-2022 Jim Subaru Work & Home
7-2022 Michael Chrysler Work
7-2022 Jim Subaru Work & Home

I HAVE a solution but it scales linearly with time as the number of rows increases. The fastest I have gotten it to run over ~32,000 rows is ~15 minutes.

Upvotes: 0

Views: 97

Answers (2)

Scott Boston
Scott Boston

Reputation: 153460

Try this:

df.groupby(['Date', 'Visitor', 'Car Name'], as_index=False)['Location'].agg(' & '.join)

Output:

     Date  Visitor  Car Name       Location
0  6-2022      Jim    Subaru    Home & Work
1  6-2022  Michael  Chrysler  Work & Karate
2  7-2022      Jim    Subaru    Home & Work
3  7-2022  Michael  Chrysler           Work

Upvotes: 1

ouroboros1
ouroboros1

Reputation: 14109

You could try as follows:

  • Use df.groupby on Date, Visitor, Car Name, and apply join to the values of Location.
  • To change the last , delimiter to an ampersand, you could use Series.replace with a regex pattern.
res = df.groupby(['Date','Visitor','Car Name'], 
                 as_index=False, sort=False)['Location'].agg(', '.join)

# changing each last `,` delimiter into an `&`
res['Location'] = res['Location'].replace(r',\s(?=[^,]*$)',' & ', regex=True)

print(res)

     Date  Visitor  Car Name       Location
0  6-2022  Michael  Chrysler  Work & Karate
1  6-2022      Jim    Subaru    Home & Work
2  7-2022  Michael  Chrysler           Work
3  7-2022      Jim    Subaru    Home & Work

If you expect duplicates per group (and want to get rid of them), you could use .agg(lambda x: ', '.join(set(x))) instead of .agg(', '.join). Mind you, this will affect the order, not sure if this is a problem. If so, there are of course also ways to get rid of duplicates in a list, while maintained order.

Upvotes: 1

Related Questions