Annotated Training data for NER corpus

It is mentioned in the documentation of opennlp that we've to train our model with 15000 line for a good performance. now, I've to extract different entities from the document which means I've to add different tags for many tokens in the training data(15000 lines) which will take a lot of time. Is there any other way to do this? which will reduce the time or any other method which I can proceed.

Thanks.

Upvotes: 4

Answers (4)

user439521

Reputation: 670

I am sorry there is really no good workaround here. We had to do this multiple times for our past projects, sometimes we were fortunate to have of labelers work for us to get the manually annotated dataset build, rest of the times we did it ourselves.

Also, I am not sure you really require 15k data items, I would suggest to start from as low as 1-2k and test the performance, based on the particular case you might be surprised by the results.

Now to build your dataset, initially we used plain old excel sheets, and quickly it turned into a nightmare, excel is not designed for such tasks and looking at 1000s of lines of text and hand annotating in excel is super painful.

Here are some of the tools I would recommend:

Dataturks: https://dataturks.com: Very easy to use online tool, provides intuitive UI and you can have a team working on the dataset simultaneously. The output is fully compatible with openNLP, coreNLP etc.

GATE: http://gate.ac.uk/: Good old tool. Downloaded to your local machine, works well, a little pain to setup.

BRAT: http://brat.nlplab.org/: An open source tool, downloadable, does good job of tagging.

Hope this helps, happy tagging :)

Upvotes: 1

demongolem

Reputation: 9708

Annotation takes time and requires someone familiar with the domain of the entities. There is no way around this problem.

At the end of the day, the annotations have to be in a format recognizable by opennlp. The basic format is as follows from the opennlp documentation:

The data can be converted to the OpenNLP name finder training format. Which is one sentence per line. Some other formats are available as well. The sentence must be tokenized and contain spans which mark the entities. Documents are separated by empty lines which trigger the reset of the adaptive feature generators. A training file can contain multiple types. If the training file contains multiple types the created model will also be able to detect these multiple types. For now it is recommended to only train single type models, since multi type support is still experimental.

So if you use one of the tools mentioned in other answers, you need to make sure that opennlp can read that format or convert that format into something that can be recognized.

Upvotes: 1