hyhno01
hyhno01

Reputation: 177

How to remove strings containing certain words from list FASTER

There is a list of sentencens sentences = ['Ask the swordsmith', 'He knows everything']. The goal is to remove those sentences that a word from a wordlist lexicon = ['word', 'every', 'thing']. This can be achieved using the following list comprehension:

newlist = [sentence for sentence in sentences if not any(word in sentence.split(' ') for word in lexicon)]

Note that if not word in sentence is not a sufficient condition as it would also remove sentences that contain words in which a word from the lexicon is embedded, e.g. word is embedded in swordsmith, and every and thing are embedded in everything.

However, my list of sentences consists of 1.000.000 sentences and my lexicon of 200.000 words. Applying the list comprehension mentioned takes hours! Because of that, I'm looking for a faster method to remove strings from a list that contain words from another list. Any suggestions? Maybe using regex?

Upvotes: 2

Views: 1658

Answers (2)

mathfux
mathfux

Reputation: 5949

You can optimize three things here:

<>Convert lexicon to set in order not to make in operation costless.

lexicon = set(lexicon)

<>Check intersection of sentence with lexicon in a most efficient way. It should use set operations. It was discussed here about performance of set intersection.

[x for x in sentences if set(x.split(' ')).isdisjoint(lexicon)]

<> Use filter instead of list comprehension.

list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))

Final code:

lexicon = set(lexicon)
list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))

Results

def removal_0(sentences, lexicon):
    lexicon = set(lexicon)
    pattern = re.compile(r'\w+')
    return [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]

def removal_1(sentences, lexicon):
    lexicon = set(lexicon)
    return [x for x in sentences if set(x.split(' ')).isdisjoint(lexicon)]

def removal_2(sentences, lexicon):
    lexicon = set(lexicon)
    return list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))

%timeit removal_0(sentences, lexicon)
%timeit removal_1(sentences, lexicon)
%timeit removal_2(sentences, lexicon)

9.88 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.19 µs ± 55.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.76 µs ± 53.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Note. So it seems filter is little bit slower but I don't know reasons yet.

Upvotes: 1

Mad Physicist
Mad Physicist

Reputation: 114270

Do your lookup in a set. This makes it fast, and alleviates the containment issue because you only look for whole words in the lexicon.

lexicon = set(lexicon)
newlist = [s for s in sentences if not any(w in lexicon for w in s.split())]

This is pretty efficient because w in lexicon is an O(1) operation, and any short-circuits. The main issue is splitting your sentence into words properly. A regular expression is inevitably going to be slower than a customized solution, but may be the best choice, depending on how robust you want to be against punctuation and the like. For example:

lexicon = set(lexicon)
pattern = re.compile(r'\w+')
newlist = [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]

Upvotes: 3

Related Questions