Reputation: 2545
I am trying to perform POS tagging using ClassifierBasedPOSTagger
with classifier_builder=MaxentClassifier.train
. Here is the piece of code:
from nltk.tag.sequential import ClassifierBasedPOSTagger
from nltk.classify import MaxentClassifier
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
print(me_tagger.evaluate(test_sents))
But after an hour of running the code, I see that it is still initialising the ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
. In the output, I can see the following piece of code running:
==> Training (100 iterations)
Iteration Log Likelihood Accuracy
---------------------------------------
1 -5.35659 0.007
2 -0.85922 0.953
3 -0.56125 0.986
I think the iterations are going to be 100 before the classifier is ready to tag parts of speech to any input. That would take whole day I suppose. Why is it taking so much time? And will decreasing the iterations make this code a bit practical(meaning reduce the time and still be useful enough), and if yes, then how to decrease those iterations?
EDIT
After 1.5 hours, I get the following output:
==> Training (100 iterations)
Iteration Log Likelihood Accuracy
---------------------------------------
1 -5.35659 0.007
2 -0.85922 0.953
3 -0.56125 0.986
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1310: RuntimeWarning: overflow encountered in power
exp_nf_delta = 2 ** nf_delta
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1312: RuntimeWarning: invalid value encountered in multiply
sum1 = numpy.sum(exp_nf_delta * A, axis=0)
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1313: RuntimeWarning: invalid value encountered in multiply
sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
Final nan 0.991
0.892155885577594
Was the algorithm supposed to get to 100 iterations
as specified in the first line of the output and because of the error it didn't? And is there any possible way of reducing the time it took for training?
Upvotes: 1
Views: 1256
Reputation: 3153
You can set parameter value of max_iter
to desired number.
Code:
from nltk.tag.sequential import ClassifierBasedPOSTagger
from nltk.classify import MaxentClassifier
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
# Change size based on your requirement
size = int(len(brown_tagged_sents) * 0.05)
print("size:",size)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
#me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=lambda train_feats: MaxentClassifier.train(train_feats, max_iter=15))
print(me_tagger.evaluate(test_sents))
Output:
('size:', 231)
==> Training (15 iterations)
Iteration Log Likelihood Accuracy
---------------------------------------
1 -4.67283 0.013
2 -0.89282 0.964
3 -0.56137 0.998
4 -0.40573 0.999
5 -0.31761 0.999
6 -0.26107 0.999
7 -0.22175 0.999
8 -0.19284 0.999
9 -0.17067 0.999
10 -0.15315 0.999
11 -0.13894 0.999
12 -0.12719 0.999
13 -0.11730 0.999
14 -0.10887 0.999
Final -0.10159 0.999
0.787489765499
For Edit:
Those messages are RuntimeWarnings and not errors.
As after 4th iteration it found Log Likelihood = nan
, so it stopped processing further. So, it became final iteration.
Upvotes: 5