ionshards
ionshards

Reputation: 21

NLTK Maximum Entropy Classifier Raw Score

this is my first question on stackoverflow, so bear with me, please.

I'm doing some corpus building, specifically trying to compose a Khmer/English parallel sentence corpus. I'm using some manually paired sentences to train a maximum entropy classifier, which will choose more parallel sentence pairs from my parallel document corpus.

My problem is that I have very little human annotated training data with which to train the classifier. Therefore, it is not a very good classifier. So, my teacher proposed that I look at the MaxEnt classifier raw scores to see if there is some score threshold above which human judgements find the sentence pairs classified as translations are actually translations of each other.

However, I am using the NLTK's MaxEnt classifier, and I cannot find a function that will give me the raw score that the classifier used to decide yes or no.

Does the NLTK's MaxEnt classifier have this functionality, or is there no way to find out the classifier raw score? Is there a package with a better MaxEnt classifier that will give you the raw score that I should be using?

Thanks in advance for the help and suggestions!!

Upvotes: 2

Views: 6260

Answers (2)

matt
matt

Reputation: 823

You may be interested in reading a recent blog of mine:

http://mattshomepage.com/#/blog/feb2013/liftingthehood

It's about understanding how the nltk.ne_chunk function works. But here's some code I wrote you can quickly copy and paste that you may find helpful:

import nltk

# Loads the serialized NEChunkParser object
chunker = nltk.data.load('chunkers/maxent_ne_chunker/english_ace_multiclass.pickle')

# The MaxEnt classifier
maxEnt = chunker._tagger.classifier()

def ne_report(sentence, report_all=False):

    # Convert the sentence into a tokens with their POS tags
    tokens = nltk.word_tokenize(sentence)
    tokens = nltk.pos_tag(tokens)

    tags = []
    for i in range(0, len(tokens)):
        featureset = chunker._tagger.feature_detector(tokens, i, tags)
        tag = chunker._tagger.choose_tag(tokens, i, tags)
        if tag != 'O' or report_all:
            print '\nExplanation on the why the word \'' + tokens[i][0] + '\' was tagged:'
            featureset = chunker._tagger.feature_detector(tokens, i, tags)
            maxEnt.explain(featureset)
        tags.append(tag) 

The report_all flag will let you view on how every word was picked, but you are probably only interested in how named entities are picked--it's set to False by default.

Just pass in any sentence you like such as "I love Apple products." and it will report back an explanation of why that named entity was picked by the MaxEnt classifier. It will also report some of the probabilities of the other tags that could have been picked.

They developers of NLTK provided a .explain() method and that is exactly what this function uses.

Upvotes: 0

Fred Foo
Fred Foo

Reputation: 363737

prob_classify gives probability scores.

If you're looking for an alternative MaxEnt classifier, then scikit-learn has two implementation of it (one based on liblinear, one using SGD training), both of which can be wrapped in an NLTK SklearnClassifier. scikit-learn calls MaxEnt logistic regression, which is the more common term outside of the NLP community.

(I may be biased because I'm a scikit-learn contributor and I wrote SklearnClassifier, but the SciPy folks are now also recommending scikit-learn instead of their own deprecated scipy.maxentropy package, on which MaxentClassifier is based.)

Upvotes: 4

Related Questions