min2bro
min2bro

Reputation: 4638

Pandas Group by and find count of common strings

My Dataframe:

pd.DataFrame({'company':['Chipotle','Branchburg Chipotle','Chipotle NJ','Chipotle 8853','The Home Depot','Home Depot','28211 Home Depot','Wendys','BJs','Buffalo wings'],
'address':['123 Main Street Branchburg NJ 08853'
,'123 Main Street Branchburg NJ 08853'
,'123 Main Street Branchburg NJ 08853'
,'123 Main Street Branchburg NJ 08853'
,'1220 N Wendover Rd Charlotte NC 28211'
,'1220 N Wendover Rd Charlotte NC 28211'
,'1220 N Wendover Rd Charlotte NC 28211'
,'2805 Whitson St Selma CA 93662'
,'2805 Whitson St Selma CA 93662'
,'2805 Whitson St Selma CA 93662']})

    company                    address
0   Chipotle            123 Main Street Branchburg NJ 08853
1   Branchburg Chipotle 123 Main Street Branchburg NJ 08853
2   Chipotle NJ         123 Main Street Branchburg NJ 08853
3   Chipotle 8853       123 Main Street Branchburg NJ 08853
4   The Home Depot      1220 N Wendover Rd Charlotte NC 28211
5   Home Depot          1220 N Wendover Rd Charlotte NC 28211
6   28211 Home Depot    1220 N Wendover Rd Charlotte NC 28211
7   Wendy's             2805 Whitson St Selma CA 93662
8   BJ's                2805 Whitson St Selma CA 93662
9   Buffalo wings       2805 Whitson St Selma CA 93662

I have to groupby Address and find the common word in the company column and write that to a new column 'count'. so for first address the common word is chipotle so count is 1 and for second address common word is home depot so count 2 and for third address no common words so count 0

Expected Ouput

     company        address                               count
0   Chipotle        123 Main Street Branchburg NJ 08853     1
1   The Home Depot  1220 N Wendover Rd Charlotte NC 28211   2
2   Wendy's         2805 Whitson St Selma CA 93662          0

I can think of iterating through the dataframe and using set intersection but that's too slow process. Is there any pandas way to achieve this?

Upvotes: 2

Views: 919

Answers (1)

rafaelc
rafaelc

Reputation: 59274

from functools import reduce
import operator
def log(x):
    inters = reduce(operator.and_, [set(r) for r in x.str.split()])
    if inters: return (' '.join(inters), len(inters))
    return (x.iloc[0], 0)
df.groupby('address').agg(log).company.apply(pd.Series).rename({0: 'company', 1: 'count'}, axis=1)

                                        company     count
address     
1220 N Wendover Rd Charlotte NC 28211   Home Depot  2
123 Main Street Branchburg NJ 08853     Chipotle    1
2805 Whitson St Selma CA 93662          Wendys      0

If pandas 0.20

.rename(columns={0: 'company', 1: 'count'})

Upvotes: 4

Related Questions