Reputation: 25357
I am using nltk
to generate n-grams from sentences by first removing given stop words. However, nltk.pos_tag()
is extremely slow taking up to 0.6 sec on my CPU (Intel i7).
The output:
['The first time I went, and was completely taken by the live jazz band and atmosphere, I ordered the Lobster Cobb Salad.']
0.620481014252
["It's simply the best meal in NYC."]
0.640982151031
['You cannot go wrong at the Red Eye Grill.']
0.644664049149
The code:
for sentence in source:
nltk_ngrams = None
if stop_words is not None:
start = time.time()
sentence_pos = nltk.pos_tag(word_tokenize(sentence))
print time.time() - start
filtered_words = [word for (word, pos) in sentence_pos if pos not in stop_words]
else:
filtered_words = ngrams(sentence.split(), n)
Is this really that slow or am I doing something wrong here?
Upvotes: 8
Views: 4369
Reputation: 1
if you use nested list you should flatten and use single list this will hep you in incresing the speed of StanfordPOSTagger
Upvotes: 0
Reputation: 470
If you are looking for another POS tagger with fast performances in Python, you might want to try RDRPOSTagger. For example, on English POS tagging, the tagging speed is 8K words/second for a single threaded implementation in Python, using a computer of Core 2Duo 2.4GHz. You can get faster tagging speed by simply using the multi-threaded mode. RDRPOSTagger obtains very competitive accuracies in comparison to state-of-the-art taggers and now supports pre-trained models for 40 languages. See experimental results in this paper.
Upvotes: 0
Reputation: 101
nltk pos_tag is defined as:
from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
tagger = PerceptronTagger()
return _pos_tag(tokens, tagset, tagger)
so each call to pos_tag instantiates the perceptrontagger module which takes much of the computation time.You can save this time by directly calling tagger.tag yourself as:
from nltk.tag.perceptron import PerceptronTagger
tagger=PerceptronTagger()
sentence_pos = tagger.tag(word_tokenize(sentence))
Upvotes: 8
Reputation: 121992
Use pos_tag_sents
for tagging multiple sentences:
>>> import time
>>> from nltk.corpus import brown
>>> from nltk import pos_tag
>>> from nltk import pos_tag_sents
>>> sents = brown.sents()[:10]
>>> start = time.time(); pos_tag(sents[0]); print time.time() - start
0.934092998505
>>> start = time.time(); [pos_tag(s) for s in sents]; print time.time() - start
9.5061340332
>>> start = time.time(); pos_tag_sents(sents); print time.time() - start
0.939551115036
Upvotes: 10