Prabhjot Singh Rai
Prabhjot Singh Rai

Reputation: 2545

How to change number of iterations in maxent classifier for POS Tagging in NLTK?

I am trying to perform POS tagging using ClassifierBasedPOSTagger with classifier_builder=MaxentClassifier.train. Here is the piece of code:

from nltk.tag.sequential import ClassifierBasedPOSTagger
from nltk.classify import MaxentClassifier
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)

train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
print(me_tagger.evaluate(test_sents))

But after an hour of running the code, I see that it is still initialising the ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train). In the output, I can see the following piece of code running:

  ==> Training (100 iterations)

  Iteration    Log Likelihood    Accuracy
  ---------------------------------------
         1          -5.35659        0.007
         2          -0.85922        0.953
         3          -0.56125        0.986

I think the iterations are going to be 100 before the classifier is ready to tag parts of speech to any input. That would take whole day I suppose. Why is it taking so much time? And will decreasing the iterations make this code a bit practical(meaning reduce the time and still be useful enough), and if yes, then how to decrease those iterations?

EDIT

After 1.5 hours, I get the following output:

  ==> Training (100 iterations)

  Iteration    Log Likelihood    Accuracy
  ---------------------------------------
         1          -5.35659        0.007
         2          -0.85922        0.953
         3          -0.56125        0.986
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1310: RuntimeWarning: overflow encountered in power
  exp_nf_delta = 2 ** nf_delta
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1312: RuntimeWarning: invalid value encountered in multiply
  sum1 = numpy.sum(exp_nf_delta * A, axis=0)
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1313: RuntimeWarning: invalid value encountered in multiply
  sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
         Final               nan        0.991
0.892155885577594

Was the algorithm supposed to get to 100 iterations as specified in the first line of the output and because of the error it didn't? And is there any possible way of reducing the time it took for training?

Upvotes: 1

Views: 1256

Answers (1)

RAVI
RAVI

Reputation: 3153

You can set parameter value of max_iter to desired number.

Code:

from nltk.tag.sequential import ClassifierBasedPOSTagger
from nltk.classify import MaxentClassifier
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
# Change size based on your requirement
size = int(len(brown_tagged_sents) * 0.05)
print("size:",size)

train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

#me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=lambda train_feats: MaxentClassifier.train(train_feats, max_iter=15))
print(me_tagger.evaluate(test_sents))

Output:

('size:', 231)
  ==> Training (15 iterations)

  Iteration    Log Likelihood    Accuracy
  ---------------------------------------
         1          -4.67283        0.013
         2          -0.89282        0.964
         3          -0.56137        0.998
         4          -0.40573        0.999
         5          -0.31761        0.999
         6          -0.26107        0.999
         7          -0.22175        0.999
         8          -0.19284        0.999
         9          -0.17067        0.999
        10          -0.15315        0.999
        11          -0.13894        0.999
        12          -0.12719        0.999
        13          -0.11730        0.999
        14          -0.10887        0.999
     Final          -0.10159        0.999
0.787489765499

For Edit:

Those messages are RuntimeWarnings and not errors.

As after 4th iteration it found Log Likelihood = nan, so it stopped processing further. So, it became final iteration.

Upvotes: 5

Related Questions