Get the count of strings found in column b from column a in any order and return counts in a new column

Question

I'm trying to get the count of substrings in column b which match column a in any order.

example:

[col a]                   [col b]                             [frequency]
big red car            elon musk drives a big red car              1
elon musk car          elon musk drives a big red car              1
red big car            elon musk drives a big red car              1

The maximum amount of matches needs to be fixed at 1. e.g. big red car would only match once, instead of matching for every combination.

I'd need to return exact match on the words if possible. car does not match for card and so on.

what I've tried:

df["frequency"] = df.apply(lambda x: x['col b'].count(x['col a']), axis=1)

this finds exact matches only, but I need them to be matched in any order.

Any help appreciated.

mozway · Accepted Answer

Assuming you want to check that all the words in "[col A]" are in "[col B]":

def ismatch(s):
    A = set(s['[col a]'].split())
    B = set(s['[col b]'].split())
    return A.intersection(B) == A
df.apply(ismatch, axis=1)

input:

         [col a]                         [col b]  [frequency]
0    big red car  elon musk drives a big red car            1
1  elon musk car  elon musk drives a big red car            1
2    red big car  elon musk drives a big red car            1
3   red big card  elon musk drives a big red car            1

output:

0    True
1    True
2    True
3   False

Get the count of strings found in column b from column a in any order and return counts in a new column

Answers (2)

Related Questions