OLGJ
OLGJ

Reputation: 432

Regex to match word but only if it doesn't start with a non-alphanumerical character

I have sentences that I want to identify words in, but not if it starts with an alphanumerical character. It's fine if it ends with one though.

An example of what I've done;

words = ["THIS", "THAT"]
sentences = ["I want to identify THIS word.", "And THAT!", "But I do not want to identify !THIS word", "Or [THIS] word"] 

for sentence in sentences:
        for word in words:
                word_re = re.search(r"\b(%s)\b" %word, sentence) 
                if word_re:
                    print("It's a match!")

My output of the code above would be a match in each of the sentences. I would like something that only matches in the first two sentences. Is it possible to do what I want with regex?

Thanks!

Upvotes: 2

Views: 868

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626926

You can use a regex like

(?<!\S)(?:THIS|THAT)\b

See the regex demo. Details:

  • (?<!\S) - a left-hand whitespace boundary
  • (?:THIS|THAT) - a non-capturing group matching either THIS or THAT
  • \b - a word boundary.

See the Python demo:

import re
words = ["THIS", "THAT"]
sentences = ["I want to identify THIS word.", "And THAT!", "But I do not want to identify !THIS word", "Or [THIS] word"] 

pattern = fr"(?<!\S)(?:{'|'.join(words)})\b"
for sentence in sentences:
    word_re = re.search(pattern, sentence) 
    if word_re:
        print(f"'{sentence}' is a match!")

# => 'I want to identify THIS word.' is a match!
#    'And THAT!' is a match!

If THIS or THAT can contain special chars, replace pattern = fr"(?<!\S)(?:{'|'.join(words)})\b" with pattern = fr"(?<!\S)(?:{'|'.join(map(re.escape, words))})\b".

Upvotes: 2

Related Questions