Reputation: 1588
As title says, I have a list of words, Like stopWords = ["the", "and", "with", etc...]
and I'm receiving text like "Kill the fox and dog". I want the output like "Kill fox dog" very efficiently and fast. How can I do this (I know I can iterate using a for loop, but thats not very efficient)
Upvotes: 6
Views: 7397
Reputation: 63
With Python the fastest operation will be making "stopwords" a set instead of a list and checking directly for membership with "x in stopwords". This structure is designed to be fast for this sort of operation.
Upvotes: 3
Reputation: 1659
As an alternative you can assemble your list in a regex and replace stop words along with surrounding spaces by a single space.
import re
stopWords = ["the", "and", "with"]
input = "Kill the fox and dog"
pattern = "\\s{:s}\\s".format("\\s|\\s".join(stopWords))
print(pattern)
print(re.sub(pattern, " ", input))
will output
\sthe\s|\sand\s|\swith\s
Kill fox dog
Upvotes: 0
Reputation: 17510
Have your stopWords in a set()
(as others have suggested), accumulate your other words into a working set then simply take the set difference using working = working - stopWords
... to have a working set with all of the stopWords filtered out of it. Or just to check of the existence of such words use a conditional. For example:
#!python
stopWords = set('the a an and'.split())
working = set('this is a test of the one working set dude'.split())
if working == working - stopWords:
print "The working set contains no stop words"
else:
print "Actually, it does"
There are actually more efficient data structures, such as a trie which could be used for large, relatively dense, set of stop words. You can find trie modules for Python, though I didn't see any written as binary (C) extensions and I wonder where the cross-over point would be between a trie implemented in pure Python vs. use of Python's set()
support. (Might also be a good case for Cython, though).
In fact I see that someone has tackled that question separately here SO: How do I create a fixed length mutable array of python objects in cython.
Ultimately, of course, you should create the simple set-based version, test and profile it, then, if necessary, try trie and Cython-trie variants as possible improvements.
Upvotes: 0
Reputation: 2804
Upvotes: 0
Reputation: 304355
The most imporant improvement is to make stopWords a set
. This means the lookups will be very fast
stopWords = set(["the", "and", "with", etc...])
" ".join(word for word in msg.split() if word not in stopWords)
If you just want to know if any of the stopWords are in the text
if any(word in stopWords for word in msg.split()):
...
Upvotes: 10
Reputation: 143102
Using list comprehension:
stopWords = ["the", "and", "with"]
msg = "kill the fox and the dog"
' '.join([w for w in msg.split() if w not in stopWords])
gives:
'kill fox dog'
Upvotes: 1