Reputation: 2556
I am trying to train a NER model in Indian with custom NE (named entity) dictionary for chunking. I refer to NLTK and Stanford NER repectively:
I found the nltk.chunk.named_entity.NEChunkParser
nechunkparser able to train on a custom corpus. However, the format of training corpus was not specified in the documentation or the comment of the source code.
Where could I find some guide to the custom corpus for NER in NLTK?
According to the question, the FAQ of Stanford NER gives direction of how to train a custom NER model.
One of the major concern is that default Stanford NER does not support Indian. So is it viable to feed an Indian NER corpus to the model?
Upvotes: 0
Views: 1222
Reputation: 774
Your Training corpus needs to be in a .tsv
file extension.
The file should some what look like this:
John PER
works O
at O
Intel ORG
This is just for representation of the data as i do not know which Indian language you are targeting. But Your data must always be Tab Separated values. First will be the token and the other value its associated label.
I have tried NER by building my custom data (in English though) and have built a model.
So I guess its pretty much possible for Indian languages also.
Upvotes: 1