Reputation: 297
I'm trying to get the count of substrings in column b which match column a in any order.
example:
[col a] [col b] [frequency]
big red car elon musk drives a big red car 1
elon musk car elon musk drives a big red car 1
red big car elon musk drives a big red car 1
The maximum amount of matches needs to be fixed at 1. e.g. big red car would only match once, instead of matching for every combination.
I'd need to return exact match on the words if possible. car does not match for card and so on.
what I've tried:
df["frequency"] = df.apply(lambda x: x['col b'].count(x['col a']), axis=1)
this finds exact matches only, but I need them to be matched in any order.
Any help appreciated.
Upvotes: 0
Views: 47
Reputation: 24304
Try via str.contains()
:
words='|'.join(df['[col a]'].unique())
#Finally:
df['[frequency]']=df['[col b]'].str.contains(words).astype(int)
#OR
df['[frequency]']=df['[col b]'].str.contains(words).view('i1')
output of df
:
[col a] [col b] [frequency]
big red car elon musk drives a big red car 1
elon musk car elon musk drives a big red car 1
red big car elon musk drives a big red car 1
Upvotes: 1
Reputation: 260300
Assuming you want to check that all the words in "[col A]" are in "[col B]":
def ismatch(s):
A = set(s['[col a]'].split())
B = set(s['[col b]'].split())
return A.intersection(B) == A
df.apply(ismatch, axis=1)
input:
[col a] [col b] [frequency]
0 big red car elon musk drives a big red car 1
1 elon musk car elon musk drives a big red car 1
2 red big car elon musk drives a big red car 1
3 red big card elon musk drives a big red car 1
output:
0 True
1 True
2 True
3 False
Upvotes: 1