yash6
yash6

Reputation: 141

how to create our own training data for opennlp parser

I am new to opennlp , need help to customize the parser

I have the used the opennlp parser with the pre-trained model en-pos-maxtent.bin to tag new raw english sentences with the corresponding parts fo speech, now i would like to customize the tags.

example sentence: Dog jumped over the wall.

after POS tagging by using en-pos-maxtent.bin , the result would be

Dog - NNP

jumped - VBD

over - IN

the - DT

wall - NN

but i want to train my own model and tag the words with my custom tags like

DOG - PERP

jumped - ACT

over - OTH

the - OTH

wall - OBJ

where PERP, ACT,OTH,OBJ are the tags that suit my necessities. is this possible ?

I checked the section of their documentation, they have given code to train a model and use it later on , the code goes like this

try {
  dataIn = new FileInputStream("en-pos.train");
  ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

  model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), null, null);
}
catch (IOException e) {
  // Failed to read or parse training data, training failed
  e.printStackTrace();
}

I am not able to understand what this "en-pos.train" is ?

what is the format of this file ? can we specify the custom tags here or what exactly this file is ?

any help would be appreciated

Thanks

Upvotes: 4

Views: 4354

Answers (2)

user439521
user439521

Reputation: 670

Here is a detailed tutorial with full code:

https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php

Depending upon your domain, you can build a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like POS tagger can help make the process much easier.

Training data format

Training data is passed as a text file where each line is one data item. Each word in the line should be labeled in a format like "word_LABEL", the word and the label name is separated by an underscore '_'.

anki_Brand overdrive_Brand
just_ModelName dance_ModelName 2018_ModelName
aoc_Brand 27"_ScreenSize monitor_Category
horizon_ModelName zero_ModelName dawn_ModelName
cm_Unknown 700_Unknown modem_Category
computer_Category
Train model

The important class here is POSModel, which holds the actual model. We use class POSTaggerME to do the model building. Below is the code to build a model from training data file

public POSModel train(String filepath) {
  POSModel model = null;
  TrainingParameters parameters = TrainingParameters.defaultParams();
  parameters.put(TrainingParameters.ITERATIONS_PARAM, "100");

  try {
    try (InputStream dataIn = new FileInputStream(filepath)) {
        ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
            @Override
            public InputStream createInputStream() throws IOException {
                return dataIn;
            }
        }, StandardCharsets.UTF_8);
        ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

        model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory());
        return model;
    }
  }
  catch (Exception e) {
    e.printStackTrace();
  }
  return null;

}

Use model to do tagging.

Finally, we can see how the model can be used to tag unseen queries:

public void doTagging(POSModel model, String input) {
    input = input.trim();
    POSTaggerME tagger = new POSTaggerME(model);
    Sequence[] sequences = tagger.topKSequences(input.split(" "));
    for (Sequence s : sequences) {
        List<String> tags = s.getOutcomes();
        System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
    }
}

Upvotes: 0

Daniel Naber
Daniel Naber

Reputation: 1664

It's documented at http://opennlp.apache.org/documentation/manual/opennlp.html#tools.postagger.training - one sentence per line, and the words are separated from their tags by an underscore:

About_IN 10_CD Euro_NNP ,_, I_PRP reckon_VBP ._.
That_DT sounds_VBZ good_JJ ._.

Upvotes: 4

Related Questions