Vitto
Vitto

Reputation: 399

Regex match word not immediately preceded by another word but possibly preceded by that word before

I need to match all strings that contain one word of a list, but only if that word is not immediately preceded by another specific word. I have this regex:

.*(?<!forbidden)\b(word1|word2|word3)\b.*

that is still matching a sentence like hello forbidden word1 because forbidden is matched by .*. But if I remove the .* I am not anymore matching strings like hello word1, which I want to match.

Note that I want to match a string like forbidden hello word1.

Could you suggest me how to fix this problem?

Upvotes: 3

Views: 1289

Answers (3)

bobble bubble
bobble bubble

Reputation: 18490

Have a look into word boundaries \bword can never touch a word character to the left.

To disallow (word1|word2|word3) if not preceded by forbidden and

  • one \W (non word character)

    ^.*?\b(?<!forbidden\W)(word1|word2|word3)\b.*
    

    See this demo at regex101

  • multiple \W

    Lookbehinds need to be of fixed length in Python regex. To get around this, an idea is to use \W* outside preceded by (?<!\W) for setting the position to look behind.

    ^.*?(?<!forbidden)(?<!\W)\W*\b(word1|word2|word3)\b.*
    

    Regex101 demo (in multiline demo I used [^\w\n] instead \W for not skipping over lines)

    Certainly variable-width lookbehind, such as (?<!forbidden\W+) would be more comfortable. PyPI Regex > import regex AS re supports lookbehind of variable length: See this demo

Note: If you do not capture anything, a (?: non-capturing groups can be used as well.

Upvotes: 3

Chicky
Chicky

Reputation: 328

If what you want is match entire string. Try this:

Regex test

^(.(?<!forbidden (word1|word2|word3)\b))*((?<!forbidden )\b(word1|word2|word3)\b)+(.(?<!forbidden (word1|word2|word3)\b))*$

The knowledge is from this thread Regular expression to match a line that doesn't contain a word

I've just reversed the order of look-around

^(.(?<!forbidden (word1|word2|word3)\b))* to discard any string that has pattern forbidden (word1|word2|word3)

((?<!forbidden )\b(word1|word2|word3)\b) is what you defined

But I just can't understand why do you need this requirement.

Upvotes: 0

razian
razian

Reputation: 9

This one seems to work well :

^.*\b(?!(?:forbidden|word[1-3])\b)\w+ (word[1-3]).*$

\b(?!(?:forbidden|word[1-3])\b)\w+ checks for multiple following words that are not forbidden or word[1-3].

So it matches hi forbidden hello word1 test but not hi hello forbidden word2 test.

Upvotes: 0

Related Questions