Reputation: 11989
I would like to get matches if a string is in whitelist
and not in blacklist
. My problem is that I can have overlaps between the two lists.
So far I have the whitelist working using
whitelist = ["but"]
blacklist = ["but now"]
# Correct, I get 'this is a test but\n not really'
re.sub(r"\b(" + r"|".join(whitelist) + r")\b", "\\1\n", "this is a test but not really")
Is there an efficient way to build a regex using whitelist
and blacklist
so that I get this kind of results?
efficient_regex = f(whitelist, blacklist)
re.sub(efficient_regex, "\\1\n", "this is a test but now it does not matter")
# And not 'this is a test but\n now it does not matter'
I'm trying to get my head around with regexp but I can't make it work so far
Upvotes: 0
Views: 310
Reputation: 11989
I finally found a solution using a single regex it uses negative lookahead assertion
and negative lookbehind assertion
.
whitelist = ["but", "however", "and yet"]
blacklist = ["but now", "anything but", "but it", "but they", "however it", "however they"]
# Can be combined into a single regex
import re
regex = re.compile(r"((?<!anything )but(?! now| it| they)|however(?! it| they)|and yet)")
You can then use only one regex to do replacements
>>> regex.sub("****", "this is a test but not really")
'this is a test **** not really'
>>> regex.sub("****", "this is a test but now it does not matter")
'this is a test but now it does not matter'
It should be possible to generate that regex from a whitelist
and blacklist
too, but I did not try that yet
Upvotes: 0
Reputation: 1166
You could try somethings like this:
import re
str_list = [ 'this is a test but not really', \
'this is a test but now it does not matter', \
'now but', 'but but but', 'but now but now']
blacklist_words = ['but now']
whitelist_words = ['but']
# building regex pattern
blacklist = re.compile('|'.join([re.escape(word) for word in blacklist_words]))
whitelist = re.compile('|'.join([re.escape(word) for word in whitelist_words]))
whitelisted_strs = [word for word in str_list \
if not blacklist.search(word) and whitelist.search(word)]
print(whitelisted_strs)
Upvotes: 1