oliver13
oliver13

Reputation: 1036

Performance: Comparing two lists in python for string matches

What is the most efficient way to compare two lists and only keep the elements that are in list A but not B for very large datasets?

Example:

words = ['shoe brand', 'car brand', 'smoothies for everyone', ...]
filters = ['brand', ...]
# Matching function
results = ['smoothies for everyone']

There have been somewhat similar questions but I'm currently dealing with 1M+ words and filters, leading to Regular Expressions overloads. I used to do a simple 'filters[i] in words[j]' test with while-loops, but this seems awfully inefficient.

Upvotes: 0

Views: 1401

Answers (2)

Jiri
Jiri

Reputation: 16625

I tried slightly modified @gnibbler version: it is using set operation intersection instead of list comprehension. I believe that this version is a bit faster.

>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if not set(w.split()).intersection(filters)]
['smoothies for everyone']

Upvotes: 2

John La Rooy
John La Rooy

Reputation: 304215

You can make filters a set

>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if all(i not in filters for i in w.split())]
['smoothies for everyone']

This works better than your filters[i] in words[j] because it won't filter "smoothies" if "smooth" is in the filter list

Upvotes: 2

Related Questions