How to write if-then-else regex?

Question

The following code is a regex for finding substring from a dataframe.

How to modify the regex as follows:

if x[0] is an English letter, that is, [a-zA-Z], then keep the first `\b`, else remove it
AND
if x[-1] is an English letter, that is, [a-zA-Z], then keep the last `\b`, else remove it

for k, v in keyword.items():
        pat = '|'.join(r"\b{}\b".format(x) for x in v)
        df[str(k)] = df['string'].str.contains(pat).astype(int)

String = 'BEAUTY Company is good, 歡迎~~YOU, SALE'
BEA: not match
Com: not match
歡迎: match
SALE: match

Thank you.

Wiktor Stribiżew · Accepted Answer

You may use

pat = r'(?!(?<=[A-Za-z])[A-Za-z])(?:{})(?



See the online regex demo.

The main thing here is the lookarounds, (?!(?<=[A-Za-z])[A-Za-z]) and (?.


The (?!(?<=[A-Za-z])[A-Za-z]) is a negative lookahead that fails the match if, immediately to the right of the current location (i.e. the first char of the keyword) is an ASCII letter that is preceded with another ASCII letter (checked with the positive lookbehind (?<=[A-Za-z])).

The (? is a negative lookbehind that fails the match if, immediately to the left of the current location (i.e. the last char of the keyword) is an ASCII letter that is followed with another ASCII letter (checked with the positive lookahead (?=[A-Za-z])).


Note that you do not have to add these lookarounds to each alternative in the regex, just use them to enclose a (?:...|...) like alternation group that you may build dynamically as I have shown above.

Also, [re.escape(x) for x in v] is handy if any of the keywords can contain special regex chars that should be treated as literal chars.

Python demo:

import re
s = 'BEAUTY Company is good, 歡迎~~YOU, SALE'
v = ['BEA','Com','歡迎','SALE']
pat = r'(?!(?<=[A-Za-z])[A-Za-z])(?:{})(?  ['歡迎', 'SALE']

How to write if-then-else regex?

Answers (2)

Related Questions