alex29
alex29

Reputation: 73

filter list in python by another list of stop-words (substrings)

I'm looking for nice pythonic way of filtering one list by another stop-list, but I want to match substrings from second list in first.

To be specific: I have list1 of URLs and list2 like:

['microsoft.com', 'ibm.com', 'cnn', '.ru'] etc

First list of URLs is huge (thousands of items), second list is smaller, like 500-1000. But simple match using "in" or sets is not enough, because second list items should be used as substring search. All I could think is two "for" loops, but they don't seem to by pythonic :)

PS Purpose is to remove matched items from first list.

Upvotes: 0

Views: 1434

Answers (2)

Fred Foo
Fred Foo

Reputation: 363717

You can build a single, disjunctive regular expression from the strings to be matched, then use the search method of the RE object to do the matching. Be sure to re.escape the strings before pasting them in the RE.

>>> import re
>>> substrings = ['microsoft.com', 'ibm.com', 'cnn', '.ru']
>>> pattern = "(?:%s)" % "|".join(map(re.escape, substrings))
>>> print(pattern)
(?:microsoft\.com|ibm\.com|cnn|\.ru)
>>> pattern = re.compile(pattern)
>>> [x for x in ["www.microsoft.com", "example.com", "foo.ru"]
...    if not pattern.search(x)]
['example.com']

Upvotes: 3

tuxuday
tuxuday

Reputation: 3037

Is this what you expected?

one=['microsoft.com', 'ibm.com', 'cnn', '.ru']
two=['.com']
filtered=[o for o in one for t in two if o.find(t)!=-1]

Upvotes: 0

Related Questions