Reputation: 33
I am building an NLP baseline script in Jupyter Notebook that should filter out all 'embolisms' from reports. However, when the word 'no' or 'not' occur in the same line/sentence, I do not want them included. This is easy with regex, once you know where the word will occur, if it occurs. But there can be many words in between.
This is the regex for excluding the 'no embolism' when they are together in the sentence:
result = re.findall('(?<!\no )(embolism?\w)', text)
The error occurring with regular regex when extending to multiple words is: "error: look-behind requires fixed-width pattern"
I have googled on how to solve it, but I did not find a solution applicable to this problem. I did also find that installing Regex with pip removes the aforementioned error. However, I'm still wondering whether there is a solution for this problem?
Best,
Upvotes: 2
Views: 435
Reputation: 163237
You can exclude the last 2 by matching them, and capture the first example that you want to keep in a group.
^(?:.*\bnot?\b.*\bembolism\b.*|.*\bembolism\b.*\bnot?\b.*)|(.*\bembolism\b.*)$
Explanation
^
Start of string(?:
Non capture group
.*\bnot?\b.*\bembolism\b.*
Match first no or not followed by embolism|
Or.*\bembolism\b.*\bnot?\b.*
Match it the other way around)
Close non capture group|
Or(.*\bembolism\b.*)
Capture group 1 (what you want to keep) containing embolism$
End of stringUpvotes: 1