Jesvin Jose
Jesvin Jose

Reputation: 23078

Slow performance of POS tagging. Can I do some kind of pre-warming?

I am using NLTK to POS-tag hundereds of tweets in a web request. As you know, Django instantiates a request handler for each request.

I noticed this: for a request (~200 tweets), the first tweet needs ~18 seconds to tag, while all subsequent tweets need ~120 milliseconds to tag. What can I do to speed up the process?

Can I do a "pre-warming request" so that the module data is already loaded for each request?

class MyRequestHandler(BaseHandler):
    def read(self, request): #this runs for a GET request
        #...in a loop:
            tokens = nltk.word_tokenize( tweet)
            tagged = nltk.pos_tag( tokens)

Upvotes: 14

Views: 6263

Answers (3)

alexmloveless
alexmloveless

Reputation: 246

As stated previously, NLTK unpickles every time is you use the standard pos_tag method. For NLTK 3.1, assuming you're happy with the NLTK's default tagger (PerceptronTagger) then the following method works for me:

First load the tagger:

from nltk.tag.perceptron import PerceptronTagger
tagger = PerceptronTagger()

Then, every time you need to tag a bit of text:

tagset = None
tokens = nltk.word_tokenize('the mat sat on the cat')
tags = nltk.tag._pos_tag(tokens, tagset, tagger)

This basically bypasses the main method. Speeded things up hundred of times for me. I assume the same method works for any of the taggers.

Upvotes: 19

thanos
thanos

Reputation: 732

nltk's POS tagger is really slow:

For me I can do 13739 tweets in 243 seconds:

  1. sent_tokenize 1.06190705299
  2. word_tokenize 4.86865639687
  3. pos_tag 233.487122536
  4. chunker 3.05982065201

See http://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/ but to summarize:

Tagger     | Accuracy | Time (130k words)
-----------+----------+------------------
CyGreedyAP |    97.1% | 4s
NLTK       |    94.0% | 3m56s
Pattern    |    93.5% | 26s
PyGreedyAP |    96.8% | 12s

Upvotes: 5

Jacob
Jacob

Reputation: 4182

Those first 18 seconds are the POS tagger being unpickled from disk into RAM. If you want to get around this, load the tagger yourself outside of a request function.

import nltk.data, nltk.tag
tagger = nltk.data.load(nltk.tag._POS_TAGGER)

And then replace nltk.pos_tag with tagger.tag. The tradeoff is that app startup will now take +18seconds.

Upvotes: 22

Related Questions