ThomasThiebaud
ThomasThiebaud

Reputation: 11989

How to match strings in whitelist but not in blacklist when both are partially overlaping

I would like to get matches if a string is in whitelist and not in blacklist. My problem is that I can have overlaps between the two lists. So far I have the whitelist working using

whitelist = ["but"]
blacklist = ["but now"]

# Correct, I get 'this is a test but\n not really'
re.sub(r"\b(" + r"|".join(whitelist) + r")\b", "\\1\n", "this is a test but not really")

Is there an efficient way to build a regex using whitelist and blacklist so that I get this kind of results?

efficient_regex = f(whitelist, blacklist)
re.sub(efficient_regex, "\\1\n", "this is a test but now it does not matter")
# And not 'this is a test but\n now it does not matter'

I'm trying to get my head around with regexp but I can't make it work so far

Upvotes: 0

Views: 310

Answers (2)

ThomasThiebaud
ThomasThiebaud

Reputation: 11989

I finally found a solution using a single regex it uses negative lookahead assertion and negative lookbehind assertion.

whitelist = ["but", "however", "and yet"]
blacklist = ["but now", "anything but", "but it", "but they", "however it", "however they"]

# Can be combined into a single regex
import re
regex = re.compile(r"((?<!anything )but(?! now| it| they)|however(?! it| they)|and yet)")

You can then use only one regex to do replacements

>>> regex.sub("****", "this is a test but not really")
'this is a test **** not really'

>>> regex.sub("****", "this is a test but now it does not matter")
'this is a test but now it does not matter'

It should be possible to generate that regex from a whitelist and blacklist too, but I did not try that yet

Upvotes: 0

Zeeshan
Zeeshan

Reputation: 1166

You could try somethings like this:

import re

str_list = [ 'this is a test but not really', \
            'this is a test but now it does not matter', \
            'now but', 'but but but', 'but now but now']

blacklist_words = ['but now']
whitelist_words = ['but']

# building regex pattern
blacklist = re.compile('|'.join([re.escape(word) for word in blacklist_words]))
whitelist = re.compile('|'.join([re.escape(word) for word in whitelist_words]))

whitelisted_strs = [word for word in str_list \
                    if not blacklist.search(word) and whitelist.search(word)]

print(whitelisted_strs)

Upvotes: 1

Related Questions