I am new to opennlp , need help to customize the parser I have the used the opennlp parser with the pre-trained model en-pos-maxtent.bin to tag new raw english sentences with the corresponding parts fo speech, now i would like to customize the tags. example sentence: Dog jumped over the wall. after POS tagging by using en-pos-maxtent.bin , the result would be Dog - NNP jumped - VBD over - IN the - DT wall - NN but i want to train my own model and tag the words with my custom tags like DOG - PERP jumped - ACT over - OTH the - OTH wall - OBJ where PERP, ACT,OTH,OBJ are the tags that suit my necessities. is this possible ? I checked the section of their documentation, they have given code to train a model and use it later on , the code goes like this try { dataIn = new FileInputStream("en-pos.train"); ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream); model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), null, null); } catch (IOException e) { // Failed to read or parse training data, training failed e.printStackTrace(); } I am not able to understand what this "en-pos.train" is ? what is the format of this file ? can we specify the custom tags here or what exactly this file is ? any help would be appreciated Thanks

how to create our own training data for opennlp parser

Reputation: 670

Here is a detailed tutorial with full code:

https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php

Depending upon your domain, you can build a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like POS tagger can help make the process much easier.

Training data format

Training data is passed as a text file where each line is one data item. Each word in the line should be labeled in a format like "word_LABEL", the word and the label name is separated by an underscore '_'.

anki_Brand overdrive_Brand
just_ModelName dance_ModelName 2018_ModelName
aoc_Brand 27"_ScreenSize monitor_Category
horizon_ModelName zero_ModelName dawn_ModelName
cm_Unknown 700_Unknown modem_Category
computer_Category
Train model

The important class here is POSModel, which holds the actual model. We use class POSTaggerME to do the model building. Below is the code to build a model from training data file

public POSModel train(String filepath) {
  POSModel model = null;
  TrainingParameters parameters = TrainingParameters.defaultParams();
  parameters.put(TrainingParameters.ITERATIONS_PARAM, "100");

  try {
    try (InputStream dataIn = new FileInputStream(filepath)) {
        ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
            @Override
            public InputStream createInputStream() throws IOException {
                return dataIn;
            }
        }, StandardCharsets.UTF_8);
        ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

        model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory());
        return model;
    }
  }
  catch (Exception e) {
    e.printStackTrace();
  }
  return null;

}

Use model to do tagging.

Finally, we can see how the model can be used to tag unseen queries:

public void doTagging(POSModel model, String input) {
    input = input.trim();
    POSTaggerME tagger = new POSTaggerME(model);
    Sequence[] sequences = tagger.topKSequences(input.split(" "));
    for (Sequence s : sequences) {
        List<String> tags = s.getOutcomes();
        System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
    }
}

Upvotes: 0

Daniel Naber

Reputation: 1664

It's documented at http://opennlp.apache.org/documentation/manual/opennlp.html#tools.postagger.training - one sentence per line, and the words are separated from their tags by an underscore:

About_IN 10_CD Euro_NNP ,_, I_PRP reckon_VBP ._.
That_DT sounds_VBZ good_JJ ._.

Upvotes: 4

how to create our own training data for opennlp parser

Answers (2)

Related Questions