searching a word in the column pandas dataframe python

Question

I have two text columns and I would like to find whether a word from one column is present in another. I wrote the below code, which works very well, but it detects if a word is present anywhere in the string. For example, it will find "ha" in "ham". I want to use regex expression instead, but I am stuck. I came across this post and looked at the second answer, but I haven't been able to modify it for my purpose. I would like to do something similar.

I would appreciate help and/or any pointers

d = {'emp': ['abc d. efg', 'za', 'sdfadsf '], 'vendor': ['ABCD enterprise', 'za industries', '' ]}
df = pd.DataFrame(data=d)
df['clean_empy_name']=df["emp"].str.lower().str.replace('\W', ' ')

def check_subset(vendor, employee):
    s = []
    for n in employee.split():
      # n=" " + n +"[^a-zA-Z\d:]"
      if ((str(n) in vendor.lower()) & (len(str(n))>1)):
        s.append(n)
    return s

check_subset("ABC-xy 54", "54 xy")

df['emp_name_find_in_vendor'] = df.apply(lambda row: check_subset(row['vendor'],row['clean_empy_name']), axis=1)
df

#########update 2

i updated my dataframe as below

d = {'emp': ['abc d. efg', 'za', 'sdfadsf ','abc','yuma'], 'vendor': ['ABCD enterprise', 'za industries', '','Person Vue\Cisco','U OF M CONTLEARNING' ]}
df = pd.DataFrame(data=d)
df['clean_empy_name']=df["emp"].str.lower().str.replace('\W', ' ')

I used code provided by first answer and it fails

in case of 'Person Vue\Cisco' it throws the error error: bad escape \c. If i remove \ in 'Person Vue\Cisco', code runs fine
in case of 'U OF M CONTLEARNING' it return u and m when clearly they are not a match

WillMonge · Accepted Answer

Yes, you can! It is going to be a little bit messy so let me construct in a few steps:

First, let's just create a regular expression for the single case of check_subset("ABC-xy 54", "54 xy"):

We will use re.findall(pattern, string) to find all the occurrences of pattern in string
The regex pattern will basically say "any of the words":
- for the "any" we use the | (or) operator
- for constructing words we need to use the parenthesis to group together... However, parenthesis (word) create a group that keeps track, so we could later call reuse these groups, since we are not interested we can create a non-capturing group by adding ?: as follows: (?:word)

import re

re.findall('(?:54)|(?:xy)', 'ABC-xy 54')
# -> ['xy', '54']

Now, we have to construct the pattern each time:

Split into words
Wrap each word inside a non-capturing group (?:)
Join all of these groups by |

re.findall('|'.join(['(?:'+x+')' for x in '54 xy'.split()]), 'ABC-xy 54')

One minor thing, since the last row's vendor is empty and you seem to want no matches (technically, the empty string matches with everything) we have to add a minor check. So we can rewrite your function to be:

def check_subset_regex(vendor, employee):
    if vendor == '':
        return []

    pattern = '|'.join(['(?:'+x+')' for x in vendor.lower().split(' ')])
    return re.findall(pattern, employee)

And then we can apply the same way:

df['emp_name_find_in_vendor_regex'] = df.apply(lambda row: check_subset_regex(row['vendor'],row['clean_empy_name']), axis=1)

One final comment is that your solution matches partial words, so employee Tom Sawyer would match "Tom" to the vendor "Atomic S.A.". The regex function I provided here will not give this as a match, should you want to do this the regex would become a little more complicated.

EDIT: Removing punctuation marks from vendors

You could either add a new column as you did with clean_employee, or simply add the removal to the function, as so (you will need to import string to get the string.punctuation, or just add in there a string with all the symbols you want to substitute):

def check_subset_regex(vendor, employee):
    if vendor == '':
        return []

    clean_vnd = re.sub('[' + string.punctuation + ']', '', vendor)

    pattern = '|'.join(['(?:'+x+')' for x in clean_vnd.lower().split(' ')])
    return re.findall(pattern, employee)

In the spirit of teaching to fish :), in regex the [] denote any of these characters... So [abc] would be the same to a|b|c.

So the re.sub line will substitute any occurrence of the string.punctuation (which evaluates to !"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~) characters by a '' (removing them).

EDIT2: Adding the possibility of a single non-alphanumeric character at the end of each searchword:

def check_subset_regex(vendor, employee):
    if vendor == '':
        return []

    clean_vnd = re.sub('[' + string.punctuation + ']', '', vendor)

    pattern = '|'.join(['(?:'+x+'[^a-zA-Z0-9]?)' for x in clean_vnd.lower().split(' ')])
    return re.findall(pattern, employee)

In this case we are using:
- ^ as the first character inside a [] (called character class), denotes any character except for those specified in the character class, e.g. [^abc] would match anything that is not a or b or c (so d, or a white space, or @) - and the ?, which means the previous symbol is optional...

So, [^a-zA-Z0-9]? means an optional single non-alphanumeric character.

searching a word in the column pandas dataframe python

Answers (1)

Related Questions