How to remove strings containing certain words from list FASTER

Question

There is a list of sentencens sentences = ['Ask the swordsmith', 'He knows everything']. The goal is to remove those sentences that a word from a wordlist lexicon = ['word', 'every', 'thing']. This can be achieved using the following list comprehension:

newlist = [sentence for sentence in sentences if not any(word in sentence.split(' ') for word in lexicon)]

Note that if not word in sentence is not a sufficient condition as it would also remove sentences that contain words in which a word from the lexicon is embedded, e.g. word is embedded in swordsmith, and every and thing are embedded in everything.

However, my list of sentences consists of 1.000.000 sentences and my lexicon of 200.000 words. Applying the list comprehension mentioned takes hours! Because of that, I'm looking for a faster method to remove strings from a list that contain words from another list. Any suggestions? Maybe using regex?

Mad Physicist · Accepted Answer

Do your lookup in a set. This makes it fast, and alleviates the containment issue because you only look for whole words in the lexicon.

lexicon = set(lexicon)
newlist = [s for s in sentences if not any(w in lexicon for w in s.split())]

This is pretty efficient because w in lexicon is an O(1) operation, and any short-circuits. The main issue is splitting your sentence into words properly. A regular expression is inevitably going to be slower than a customized solution, but may be the best choice, depending on how robust you want to be against punctuation and the like. For example:

lexicon = set(lexicon)
pattern = re.compile(r'\w+')
newlist = [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]

How to remove strings containing certain words from list FASTER

Answers (2)

Related Questions