Reputation: 462
I'm usually pretty good with Regex but I'm struggling with this one. I need a regular expression that matches the term cbd
but not if the phrase central business district
appears anywhere else in the search string. Or if that is too difficult, at least matches cbd
if the phrase central business district
doesn't appear anywhere before the term cbd
. Only the cbd
part should be returned as the result, so I'm using lookaheads/lookbehinds, but I have not been able to meet the requirements...
Input examples:
GOOD
Any products containing CBD are to be regulated.
BAD Properties located within the Central Business District (CBD) are to be regulated
I have tried:
(?!central business district)cbd
(.*(?!central business district).*)cbd
This is in Python 3.6+ using the re
module.
I know it would be easy to accomplish with a couple lines of code, but we have a list of regex strings in a database that we are using to search a corpus for documents that contain any one of the regex strings from the DB. It is best to avoid hard-coding any keywords into the scripts because then it would not be clear to our other developers where these matches are coming from because they can't see it in the database.
Upvotes: 3
Views: 1285
Reputation: 18611
Use PyPi regex with
import regex
strings = [' I need a regular expression that matches the term cbd but not if the phrase central business district appears anywhere else in the search string.', 'I need cbd here.']
for s in strings:
x = regex.search(r'(?<!central business district.*)cbd(?!.*central business district)', s, regex.S)
if x:
print(s, x.group(), sep=" => ")
Results: I need cbd here. => cbd
. See Python code.
Explanation
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
central business 'central business district'
district
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
cbd 'cbd'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
central business 'central business district'
district
--------------------------------------------------------------------------------
) end of look-ahead
Upvotes: 3