Reputation: 177
There is a list of sentencens sentences = ['Ask the swordsmith', 'He knows everything']
. The goal is to remove those sentences that a word from a wordlist lexicon = ['word', 'every', 'thing']
. This can be achieved using the following list comprehension:
newlist = [sentence for sentence in sentences if not any(word in sentence.split(' ') for word in lexicon)]
Note that if not word in sentence
is not a sufficient condition as it would also remove sentences that contain words in which a word from the lexicon is embedded, e.g. word
is embedded in swordsmith
, and every
and thing
are embedded in everything
.
However, my list of sentences consists of 1.000.000 sentences and my lexicon of 200.000 words. Applying the list comprehension mentioned takes hours! Because of that, I'm looking for a faster method to remove strings from a list that contain words from another list. Any suggestions? Maybe using regex?
Upvotes: 2
Views: 1658
Reputation: 5949
You can optimize three things here:
<>Convert lexicon to set
in order not to make in
operation costless.
lexicon = set(lexicon)
<>Check intersection of sentence
with lexicon
in a most efficient way. It should use set
operations. It was discussed here about performance of set intersection.
[x for x in sentences if set(x.split(' ')).isdisjoint(lexicon)]
<> Use filter
instead of list comprehension.
list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))
Final code:
lexicon = set(lexicon)
list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))
Results
def removal_0(sentences, lexicon):
lexicon = set(lexicon)
pattern = re.compile(r'\w+')
return [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]
def removal_1(sentences, lexicon):
lexicon = set(lexicon)
return [x for x in sentences if set(x.split(' ')).isdisjoint(lexicon)]
def removal_2(sentences, lexicon):
lexicon = set(lexicon)
return list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))
%timeit removal_0(sentences, lexicon)
%timeit removal_1(sentences, lexicon)
%timeit removal_2(sentences, lexicon)
9.88 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.19 µs ± 55.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.76 µs ± 53.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Note. So it seems filter is little bit slower but I don't know reasons yet.
Upvotes: 1
Reputation: 114270
Do your lookup in a set
. This makes it fast, and alleviates the containment issue because you only look for whole words in the lexicon.
lexicon = set(lexicon)
newlist = [s for s in sentences if not any(w in lexicon for w in s.split())]
This is pretty efficient because w in lexicon
is an O(1)
operation, and any
short-circuits. The main issue is splitting your sentence into words properly. A regular expression is inevitably going to be slower than a customized solution, but may be the best choice, depending on how robust you want to be against punctuation and the like. For example:
lexicon = set(lexicon)
pattern = re.compile(r'\w+')
newlist = [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]
Upvotes: 3