Thor Correia
Thor Correia

Reputation: 1588

If I have a list of words, how can I check if string does not contain any of the words in the list, and efficiently?

As title says, I have a list of words, Like stopWords = ["the", "and", "with", etc...] and I'm receiving text like "Kill the fox and dog". I want the output like "Kill fox dog" very efficiently and fast. How can I do this (I know I can iterate using a for loop, but thats not very efficient)

Upvotes: 6

Views: 7397

Answers (6)

jboggan
jboggan

Reputation: 63

With Python the fastest operation will be making "stopwords" a set instead of a list and checking directly for membership with "x in stopwords". This structure is designed to be fast for this sort of operation.

See the set documentation

Upvotes: 3

mlt
mlt

Reputation: 1659

As an alternative you can assemble your list in a regex and replace stop words along with surrounding spaces by a single space.

import re
stopWords = ["the", "and", "with"]
input = "Kill the fox and dog"
pattern = "\\s{:s}\\s".format("\\s|\\s".join(stopWords))
print(pattern)
print(re.sub(pattern, " ", input))

will output

\sthe\s|\sand\s|\swith\s
Kill fox dog

Upvotes: 0

Jim Dennis
Jim Dennis

Reputation: 17510

Have your stopWords in a set() (as others have suggested), accumulate your other words into a working set then simply take the set difference using working = working - stopWords ... to have a working set with all of the stopWords filtered out of it. Or just to check of the existence of such words use a conditional. For example:

#!python
stopWords = set('the a an and'.split())
working   = set('this is a test of the one working set dude'.split())
if working == working - stopWords:
    print "The working set contains no stop words"
else:
    print "Actually, it does"

There are actually more efficient data structures, such as a trie which could be used for large, relatively dense, set of stop words. You can find trie modules for Python, though I didn't see any written as binary (C) extensions and I wonder where the cross-over point would be between a trie implemented in pure Python vs. use of Python's set() support. (Might also be a good case for Cython, though).

In fact I see that someone has tackled that question separately here SO: How do I create a fixed length mutable array of python objects in cython.

Ultimately, of course, you should create the simple set-based version, test and profile it, then, if necessary, try trie and Cython-trie variants as possible improvements.

Upvotes: 0

user845279
user845279

Reputation: 2804

  1. Put your original list of words in a dictionary.
  2. Iterate through the characters in the given string, using space as a delimiter for a word. Look up each word in the dictionary.

Upvotes: 0

John La Rooy
John La Rooy

Reputation: 304355

The most imporant improvement is to make stopWords a set. This means the lookups will be very fast

stopWords = set(["the", "and", "with", etc...])
" ".join(word for word in msg.split() if word not in stopWords)

If you just want to know if any of the stopWords are in the text

if any(word in stopWords for word in msg.split()):
    ...

Upvotes: 10

Levon
Levon

Reputation: 143102

Using list comprehension:

stopWords = ["the", "and", "with"]
msg = "kill the fox and the dog"

' '.join([w for w in msg.split() if w not in stopWords])

gives:

'kill fox dog'

Upvotes: 1

Related Questions