Train corpus for NER with NLTK ieer or conll2000 corpus

Question

I have been trying to train a model for Named Entity Recognition for a specific domain, and with new entities. It seems there is not a completed suitable pipeline for this, and there is the need to use different packages.

I would like to give a chance to NLTK. My question is, how can I train a the NLTK NER to classify and match new entities using the ieer corpus?

I will of course provide training data with the IOB-Format like:

We PRP B-NP
saw VBD O
the DT B-NP
yellow JJ I-NP
dog NN I-NP

I guess I will have to tag the tokens by myself.

What do I do next when I have a text file in this format, what are the steps to train my data with the ieer corpus, or with a better one, conll2000?

I know there is some documentation out there, but it is not clear for me what to do after you have a training corpus tagged.

I want to go for NLTK because I then want to use the relextract() function.

Please any advise.

Thanks

alexis · Accepted Answer

The nltk provides everything you need. Read the nltk book's chapter 6, on Learning to Classify Text. It gives you a worked example of classification. Then study sections 2 and 3 from Chapter 7, which show you how to work with IOB text and write a chunking classifier. Although the example application is not named entity recognition, the code examples should need almost no changes to work (although of course you'll need a custom feature function to get decent performance.)

You can also use the nltk's tagger (or another tagger) to add POS tags to your corpus, or you could take your chances and try to train a classifier on data without part-of-speech tags (just the IOB named entity categories). My guess is that POS tagging will improve performance, and you're actually much better off if the same POS tagger is used on the training data as for evaluation (and eventually production use).

Train corpus for NER with NLTK ieer or conll2000 corpus

Answers (1)

Related Questions