Reputation: 1036
What is the most efficient way to compare two lists and only keep the elements that are in list A but not B for very large datasets?
Example:
words = ['shoe brand', 'car brand', 'smoothies for everyone', ...]
filters = ['brand', ...]
# Matching function
results = ['smoothies for everyone']
There have been somewhat similar questions but I'm currently dealing with 1M+ words and filters, leading to Regular Expressions overloads. I used to do a simple 'filters[i] in words[j]' test with while-loops, but this seems awfully inefficient.
Upvotes: 0
Views: 1401
Reputation: 16625
I tried slightly modified @gnibbler version: it is using set operation intersection instead of list comprehension. I believe that this version is a bit faster.
>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if not set(w.split()).intersection(filters)]
['smoothies for everyone']
Upvotes: 2
Reputation: 304215
You can make filters a set
>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if all(i not in filters for i in w.split())]
['smoothies for everyone']
This works better than your filters[i] in words[j]
because it won't filter "smoothies" if "smooth" is in the filter list
Upvotes: 2