Faster Way to Categorize Unique Row Combinations based on a Condition in Python Pandas?

Question

I have a dataset that has the same layout as the one below. I want to categorize where each person went in their car for a given month.

Date	Visitor	Car Name	Location
6-2022	Michael	Chrysler	Work
6-2022	Jim	Subaru	Home
6-2022	Jim	Subaru	Work
6-2022	Michael	Chrysler	Karate
7-2022	Michael	Chrysler	Work
7-2022	Jim	Subaru	Home
7-2022	Jim	Subaru	Work

The solution would look something like this:

Date	Visitor	Car Name	Location
6-2022	Michael	Chrysler	Work & Karate
6-2022	Jim	Subaru	Work & Home
7-2022	Michael	Chrysler	Work
7-2022	Jim	Subaru	Work & Home

I HAVE a solution but it scales linearly with time as the number of rows increases. The fastest I have gotten it to run over ~32,000 rows is ~15 minutes.

ouroboros1 · Accepted Answer

You could try as follows:

Use df.groupby on Date, Visitor, Car Name, and apply join to the values of Location.
To change the last , delimiter to an ampersand, you could use Series.replace with a regex pattern.

res = df.groupby(['Date','Visitor','Car Name'], 
                 as_index=False, sort=False)['Location'].agg(', '.join)

# changing each last `,` delimiter into an `&`
res['Location'] = res['Location'].replace(r',\s(?=[^,]*$)',' & ', regex=True)

print(res)

     Date  Visitor  Car Name       Location
0  6-2022  Michael  Chrysler  Work & Karate
1  6-2022      Jim    Subaru    Home & Work
2  7-2022  Michael  Chrysler           Work
3  7-2022      Jim    Subaru    Home & Work

If you expect duplicates per group (and want to get rid of them), you could use .agg(lambda x: ', '.join(set(x))) instead of .agg(', '.join). Mind you, this will affect the order, not sure if this is a problem. If so, there are of course also ways to get rid of duplicates in a list, while maintained order.

Faster Way to Categorize Unique Row Combinations based on a Condition in Python Pandas?

Answers (2)

Related Questions