784347
784347

Reputation: 83

Regex that matches punctuation at the word boundary including underscore

I am looking for a Python regex for a variable phrase with the following properties: (For the sake of example, let's assume the variable phrase here is taking the value and. But note that I need to do this in a way that the thing playing the role of and can be passed in as a variable which I'll call phrase.)

Should match: this_and, this.and, (and), [and], and^, ;And, etc.

Should not match: land, andy

This is what I tried so far (where phrase is playing the role of and):

pattern = r"\b  " + re.escape(phrase.lower()) + r"\b"            

This seems to work for all my requirements except that it does not match words with underscores e.g. \_hello, hello\_, hello_world.

Edit: Ideally I would like to use the standard library re module rather than any external packages.

Upvotes: 5

Views: 2713

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

You may use

r'(?<![^\W_])and(?![^\W_])'

See the regex demo. Compile with the re.I flag to enable case insensitive matching.

Details

  • (?<![^\W_]) - the preceding char should not be a letter or digit char
  • and - some keyword
  • (?![^\W_]) - the next char cannot be a letter or digit

Python demo:

import re
strs = ['this_and', 'this.and', '(and)', '[and]', 'and^', ';And', 'land', 'andy']
phrase = "and"
rx = re.compile(r'(?<![^\W_]){}(?![^\W_])'.format(re.escape(phrase)), re.I)
for s in strs:
    print("{}: {}".format(s, bool(rx.search(s))))

Output:

this_and: True
this.and: True
(and): True
[and]: True
and^: True
;And: True
land: False
andy: False

Upvotes: 6

user101
user101

Reputation: 506

Here is a regex that might solve it:

Regex

(?<=[\W_]+|^)and(?=[\W_]+|$)

Example

# import regex

string = 'this_And'
test = regex.search(r'(?<=[\W_]+|^)and(?=[\W_]+|$)', string.lower())
print(test.group(0))
# prints 'and'

# No match
string = 'Andy'
test = regex.search(r'(?<=[\W_]+|^)and(?=[\W_]+|$)', string.lower())
print(test)
# prints None

strings = [ "this_and", "this.and", "(and)", "[and]", "and^", ";And"]
[regex.search(r'(?<=[\W_]+|^)and(?=[\W_]+|$)', s.lower()).group(0) for s in strings if regex.search(r'(?<=[\W_]+|^)and(?=[\W_]+|$)', s.lower())]
# prints ['and', 'and', 'and', 'and', 'and', 'and']

Explanation

[\W_]+ means we accept before (?<=) or after (?=) and only non-word symbols except the underscore _ (a word symbol that) is accepted. |^ and |$ allow matches to lie at the edge of the string.

Edit

As mentioned in my comment, the module regex does not yield errors with variable lookbehind lengths (as opposed to re).

# This works fine
# import regex
word = 'and'
pattern = r'(?<=[\W_]+|^){}(?=[\W_]+|$)'.format(word.lower())
string = 'this_And'
regex.search(pattern, string.lower())

However, if you insist on using re, then of the top of my head I'd suggest splitting the lookbehind in two (?<=[\W_])and(?=[\W_]+|$)|^and(?=[\W_]+|$) that way cases where the string starts with and are captured as well.

# This also works fine
# import re
word = 'and'
pattern = r'(?<=[\W_]){}(?=[\W_]+|$)|^{}(?=[\W_]+|$)'.format(word.lower(), word.lower())
string = 'this_And'
re.search(pattern, string.lower())

Upvotes: 2

Related Questions