user3590149
user3590149

Reputation: 1605

Finding word pattern between key characters

Trying to create regular expression that finds the key word within one long string and when the key word is not surrounded by an letter. If the string is surrounded by a dash or underscore as long as it is not surround by letter. Only need to find one occurrence of the word, to consider match. Only care about finding it in one long string. Currently, I can't get it to go True for when the word has a '_' next to it. Any ideas for a better expression?

Edit- I found a case where I need it to be true and did not add it to the example.

import re

key_words = ['go', 'at', 'why', 'stop' ]

false_match = ['going_get_that', 'that_is_wstop', 'whysper','stoping_tat' ]

positive_match = ['go-around', 'go_at_going','stop-by_the_store', 'stop','something-stop', 'something_stop']
pattern = r"\b(%s)\b" % '|'.join(key_words)

for word in false_match + positive_match:
    if re.match(pattern,word):
         print True, word
    else:
         print False, word

Current Output:

False going_get_that
False that_is_wstop
False whysper
False stoping_tat
True go-around
False go_at_going
True stop-by_the_store
True stop

Edit - This needs to be True

  False something-stop
  False something_stop

Desired output:

    False going_get_that
    False that_is_wstop
    False whysper
    False stoping_tat
    True go-around
    True go_at_going
    True stop-by_the_store
    True stop
    True something-stop
    True something_stop

Upvotes: 2

Views: 60

Answers (3)

Kasravnd
Kasravnd

Reputation: 107307

Your pattern is close , you can use the following pattern :

pattern = r"\b([a-zA-Z]+[-_])?(%s)([-_][a-zA-Z_-]+)?\b" % '|'.join(key_words)

Regular expression visualization

Debuggex Demo

Upvotes: 0

L3viathan
L3viathan

Reputation: 27283

Use negative look(ahead|behind)s:

import re

key_words = ['go', 'at', 'why', 'stop' ]

false_match = ['going_get_that', 'that_is_wstop', 'whysper','stoping_tat' ]

positive_match = ['go-around', 'go_at_going','stop-by_the_store', 'stop', 'something-stop', 'something_stop']
pattern = r"(?<![a-zA-Z])(%s)(?![a-zA-Z])" % '|'.join(key_words)

for word in false_match + positive_match:
    if re.search(pattern,word):
         print True, word
    else:
         print False, word

Upvotes: 1

vks
vks

Reputation: 67968

The problem with \b is it considers _ to be a part of \w .So it is not a word boundary.To negate that effect make your own character class.

(?:^|(?<=[^a-zA-Z0-9]))(go|at|why|stop)(?=[^a-zA-Z0-9]|$)

Try this.See demo.

https://regex101.com/r/oC3qA3/8

import re

key_words = ['go', 'at', 'why', 'stop' ]

false_match = ['going_get_that', 'that_is_wstop', 'whysper','stoping_tat' ]

positive_match = ['go-around', 'go_at_going','stop-by_the_store', 'stop']
pattern = r"(?:^|(?<=[^a-zA-Z0-9]))(%s)(?=[^a-zA-Z0-9]|$)" % '|'.join(key_words)

for word in false_match + positive_match:
    if re.findall(pattern,word):
         print True, word
    else:
         print False, word

Output:False going_get_that False that_is_wstop False whysper False stoping_tat True go-around True go_at_going True stop-by_the_store True stop

Upvotes: 0

Related Questions