Reputation: 73
I'm looking for nice pythonic way of filtering one list by another stop-list, but I want to match substrings from second list in first.
To be specific: I have list1 of URLs and list2 like:
['microsoft.com', 'ibm.com', 'cnn', '.ru'] etc
First list of URLs is huge (thousands of items), second list is smaller, like 500-1000. But simple match using "in" or sets is not enough, because second list items should be used as substring search. All I could think is two "for" loops, but they don't seem to by pythonic :)
PS Purpose is to remove matched items from first list.
Upvotes: 0
Views: 1434
Reputation: 363717
You can build a single, disjunctive regular expression from the strings to be matched, then use the search
method of the RE object to do the matching. Be sure to re.escape
the strings before pasting them in the RE.
>>> import re
>>> substrings = ['microsoft.com', 'ibm.com', 'cnn', '.ru']
>>> pattern = "(?:%s)" % "|".join(map(re.escape, substrings))
>>> print(pattern)
(?:microsoft\.com|ibm\.com|cnn|\.ru)
>>> pattern = re.compile(pattern)
>>> [x for x in ["www.microsoft.com", "example.com", "foo.ru"]
... if not pattern.search(x)]
['example.com']
Upvotes: 3
Reputation: 3037
Is this what you expected?
one=['microsoft.com', 'ibm.com', 'cnn', '.ru']
two=['.com']
filtered=[o for o in one for t in two if o.find(t)!=-1]
Upvotes: 0