StepTNT
StepTNT

Reputation: 3967

Creating and training a model for OpenNlp using BRAT?

I may need to create a custom training set for OpenNLP, and this will require me to manually annotate a lot of entries.

To make things easier, a GUI solution may be the best idea (manually writing annotation tags it's not cool), and I've just discovered BRAT which looks like what I need.

BRAT can export an annotated file (.ann), but I'm not finding any reference to this filetype in OpenNLP's manual and I'm not sure that this can work.

What I'd like to do is to export this annotated file from BRAT and use it to train an OpenNLP's model, and I don't really care if it can be done using code or CLI.

Can someone point me in the right direction?

Upvotes: 3

Views: 1156

Answers (1)

Joern
Joern

Reputation: 186

OpenNLP has native support for the BRAT format for training and evaluation of the Name Finder. Other components are not supported currently. Adding support for other components would probably not be difficult and in case you are interested you should ask for it on the opennlp-dev list.

The CLI can be used to train a model with brat, here is the command which will show you the usage:

  • bin/opennlp TokenNameFinderTrainer.brat

The following arguments are mandatory to train a model:

  • bratDataDir this should point to a folder containing your .ann and .txt files
  • annotationConfig this has to point to the config file brat is using for annotation project
  • lang the language of your text documents (e.g. en)
  • model the name of the created model file

The Name Finder needs its input cut into sentences and into tokens. By default it assumes one sentence per line and applies white space tokenization. This behavior can be adjusted with the ruleBasedTokenizer or tokenizerModel arguments. Additional it is possible to use a custom sentence detector model via the sentenceDetector Model argument.

To evaluate your model the cross validation and evaluation tools can be used in a simliar way by attaching .brat to their names.

  • bin/opennlp TokenNameFinderCrossValidator.brat
  • bin/opennlp TokenNameFinderEvaluator.brat

To speed up your annotation project you can use the opennlp-brat-annotator. It can load the Name Finder model and integrates with BRAT to automatically annotate your documents. This can speed up your annotation effort. You can find that component in the opennlp sandbox.

Upvotes: 3

Related Questions