codedancer
codedancer

Reputation: 1634

Count the word but ignore when it has a word with first letter capitalized before

I am trying to determine whether the word "McDonald" is in the cell. However, I wish to ignore the case where the word before "McDonald" has a first captilized letter like 'Kevin McDonald'. Any suggestion how to get it right through regex in a pandas dataframe?

data = {'text':["Kevin McDonald has bought a burger.", 
                "The best burger in McDonald is cheeze buger."]}

df = pd.DataFrame(data)
long_list = ['McDonald', 'Five Guys']

# matching any of the words
pattern = r'\b{}\b'.format('|'.join(long_list))

df['count'] = df.text.str.count(pattern)
                                           text
0           Kevin McDonald has bought a burger.
1  The best burger in McDonald is cheeze buger.

Expected output:

                                           text  count
0           Kevin McDonald has bought a burger.      0
1  The best burger in McDonald is cheeze buger.      1

Upvotes: 0

Views: 77

Answers (2)

mozway
mozway

Reputation: 261015

IIUC, the goal is not to match when there is a preceding word that is capitalized. Checking that there is a non capitalized word before would remove many legitimate possibilities.

Here is a regex that works for a few more possibilities (start of sentence, non word before):

regex = '|'.join(fr'(?:\b[^A-Z]\S*\s+|[^\w\s] ?|^){i}' for i in long_list)
df['count'] = df['text'].str.count(regex)

example:

                                           text  count
0           Kevin McDonald has bought a burger.      0
1  The best burger in McDonald is cheeze buger.      1
2                       McDonald's restaurants.      1
3                 Blah. McDonald's restaurants.      1

You can test and understand the regex here

Upvotes: 1

darth baba
darth baba

Reputation: 1398

You can try this pattern:

pattern = r'\b[a-z].*?\b {}'.format('|'.join(long_list))

df['count'] = df.text.str.count(pattern)

Upvotes: 2

Related Questions