Zelong
Zelong

Reputation: 2556

Named entity recognition with NLTK or Stanford NER using custom corpus

I am trying to train a NER model in Indian with custom NE (named entity) dictionary for chunking. I refer to NLTK and Stanford NER repectively:

  1. NLTK

I found the nltk.chunk.named_entity.NEChunkParser nechunkparser able to train on a custom corpus. However, the format of training corpus was not specified in the documentation or the comment of the source code.

Where could I find some guide to the custom corpus for NER in NLTK?

  1. Stanford NER

According to the question, the FAQ of Stanford NER gives direction of how to train a custom NER model.

One of the major concern is that default Stanford NER does not support Indian. So is it viable to feed an Indian NER corpus to the model?

Upvotes: 0

Views: 1222

Answers (1)

Rohan Amrute
Rohan Amrute

Reputation: 774

Your Training corpus needs to be in a .tsv file extension.

The file should some what look like this:

John PER
works O
at O
Intel ORG

This is just for representation of the data as i do not know which Indian language you are targeting. But Your data must always be Tab Separated values. First will be the token and the other value its associated label.

I have tried NER by building my custom data (in English though) and have built a model.

So I guess its pretty much possible for Indian languages also.

Upvotes: 1

Related Questions