Performance: Comparing two lists in python for string matches

Question

What is the most efficient way to compare two lists and only keep the elements that are in list A but not B for very large datasets?

Example:

words = ['shoe brand', 'car brand', 'smoothies for everyone', ...]
filters = ['brand', ...]
# Matching function
results = ['smoothies for everyone']

There have been somewhat similar questions but I'm currently dealing with 1M+ words and filters, leading to Regular Expressions overloads. I used to do a simple 'filters[i] in words[j]' test with while-loops, but this seems awfully inefficient.

Jiri · Accepted Answer

I tried slightly modified @gnibbler version: it is using set operation intersection instead of list comprehension. I believe that this version is a bit faster.

>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if not set(w.split()).intersection(filters)]
['smoothies for everyone']

Performance: Comparing two lists in python for string matches

Answers (2)

Related Questions