Hari
Hari

Reputation: 193

How do I perform exact string match on python

I have a set of words

words = {'thanks giving', 'cat', 'instead of',etc...}

I need to search exactly these words in table column 'description'

--------------------------------|
ID  | Description               |
--- |---------------------------|
1   | having fun   thanks giving| 
----|---------------------------|
2   |  cat eats all the food    |
----|---------------------------|
3   |  instead you can come     | 
--------------------------------

def matched_words(x,words):
   match_words =[]
  for word in words:
     if word in x:
       match_words.append(word)
  return match_words

df['new_col'] = df['description'].apply(lambda x:matched_words(x,words))

desired output :

----|---------------------------|-------------------|
ID  | Description               |matched words      |
--- |---------------------------|-------------------|
1   | having fun   thanks giving|['thanks giving']  |
----|---------------------------|------------------ |
2   |  cat eats all the food    |['cat']            |
----|---------------------------|-------------------|
3   |  instead you can come     | []                |
----------------------------------------------------

I'm getting matches only single tokens like ['cat']

Upvotes: 0

Views: 7112

Answers (2)

Chris Larson
Chris Larson

Reputation: 1714

The following code should give you the results you're looking for:

import re

words = {'thanks', 'cat', 'instead of'}
phrases = [
    [1,"having fun at thanksgiving"],
    [2,"cater the food"],
    [3, "instead you can come"],
    [4, "instead of pizza"],
    [5, "thanks for all the fish"]
]

matched_words = []
matched_pairs = []
for word in words:
    for phrase in phrases:
        result = re.search(r'\b'+word+'\W', phrase[1])
        if result:
            matched_words.append(result.group(0))
            matched_pairs.append([result.group(0), phrase])
            print()

print(matched_words)
print(matched_pairs)

The relevant part, that is, the regex bit re.search(r'\b'+word+'\W', phrase[1]), is searching for cases in which our search string is found beginning at a word boundary \b, or empty string, and ending in a non-word character \W. This should ensure that we find only whole-string matches. No need to do anything else to the text you want to search.

Of course, you can use anything you want instead of words, phrases, matched_words and matched_pairs.

Hope this helps!

Upvotes: 1

Stephen Rauch
Stephen Rauch

Reputation: 49784

import re
words = {'thanks', 'cat', 'instead of'}

samples = [
    (1, 'having fun at thanksgiving'),
    (2, 'cater the food'),
    (3, 'instead you can come'),
    (4, 'instead of you can come'),
]

for id, description in samples:
    for word in words:
        if re.search(r'\b' + word + r'\b', description):
            print("'%s' in '%s" % (word, description))

Upvotes: 0

Related Questions