mevers303
mevers303

Reputation: 462

Regex that matches a word but only if another word doesn't appear?

I'm usually pretty good with Regex but I'm struggling with this one. I need a regular expression that matches the term cbd but not if the phrase central business district appears anywhere else in the search string. Or if that is too difficult, at least matches cbd if the phrase central business district doesn't appear anywhere before the term cbd. Only the cbd part should be returned as the result, so I'm using lookaheads/lookbehinds, but I have not been able to meet the requirements...

Input examples:
GOOD Any products containing CBD are to be regulated.
BAD    Properties located within the Central Business District (CBD) are to be regulated

I have tried:

This is in Python 3.6+ using the re module.

I know it would be easy to accomplish with a couple lines of code, but we have a list of regex strings in a database that we are using to search a corpus for documents that contain any one of the regex strings from the DB. It is best to avoid hard-coding any keywords into the scripts because then it would not be clear to our other developers where these matches are coming from because they can't see it in the database.

Upvotes: 3

Views: 1285

Answers (1)

Ryszard Czech
Ryszard Czech

Reputation: 18611

Use PyPi regex with

import regex
strings = [' I need a regular expression that matches the term cbd but not if the phrase central business district appears anywhere else in the search string.', 'I need cbd here.']
for s in strings:
  x = regex.search(r'(?<!central business district.*)cbd(?!.*central business district)', s, regex.S)
  if x:
    print(s, x.group(), sep=" => ")

Results: I need cbd here. => cbd. See Python code.

Explanation

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    central business         'central business district'
    district
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  cbd                      'cbd'
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    central business         'central business district'
    district
--------------------------------------------------------------------------------
  )                        end of look-ahead

Upvotes: 3

Related Questions