yash6
yash6

Reputation: 141

training parts-of-speech tagger in opennlp

I am trying to train the opennlp POS tagger which would tag the words in a sentence according to my specific vocabulary.for example :

After normal POS tagging:

sentence: NodeManager/NNP failed/VBD to/TO start/VB the/DT server/NN

After using my model of pos tagging :

sentence: NodeManager/AGENT failed/OTHER to/OTHER start/OTHER the/OTHER server/OBJECT

where AGENT,OTHER,OBJECT are the tags tat i defined.

so basically i am defining my own tag dictionary.And want the POS tagger to use my model.

wen i checked in the apache documentation for doing this

i found the below code

POSModel model = null;

InputStream dataIn = null;
try {
  dataIn = new FileInputStream("en-pos.train");
  ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
  ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

  model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), null, null);
}
catch(IOException e)
{
   e.printStackTrace();
}
finally {
  if (dataIn != null) {
    try {
      dataIn.close();
    }
    catch (IOException e) {
      // Not an issue, training already finished.
      // The exception should be logged and investigated
      // if part of a production system.
      e.printStackTrace();
    }
  }
}

here when they are opening the FileInputStream to en-pos.train, i guess this en-pos.train is a .bin file like all the ones they have used before , but just that it is customized. can someone tell me how to get the .bin file for it ?

or where is en-pos.train ? what exactly is it? how to create it?

i extracted the bin file tat they normally use

en-pos-maxent.bin. it has the xml file where we define the tag dictionary, a model file and a properties file. i have changed them according to my needs , but my problem is generating the .bin file from the contents.

Upvotes: 1

Views: 1872

Answers (2)

user439521
user439521

Reputation: 670

Its pretty simple to do:

Once you train your own model, dump it to a file (call it whatever you want):

public void writeToFile(POSModel model, String modelOutpath) {
    try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelOutpath))) {
        model.serialize(modelOut);
    }
    catch (Exception e) {
        e.printStackTrace();
    }
}

Then load the file as shown below:

public POSModel getModel(String modelPath) {
try {
    try (InputStream modelIn = new FileInputStream(modelPath)) {
        POSModel model = new POSModel(modelIn);
        return model;
    }
}
catch (Exception e) {
    e.printStackTrace();
}
return model;

}

Now you can use the loaded model and do tagging.

    public void doTagging(POSModel model, String input) {
    input = input.trim();
    POSTaggerME tagger = new POSTaggerME(model);
    Sequence[] sequences = tagger.topKSequences(input.split(" "));
    for (Sequence s : sequences) {
        List<String> tags = s.getOutcomes();
        System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
    }
}

Here is a detailed tutorial with the full code on how to train and use your own Open NLP based POS tagger:

https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php

Upvotes: 0

andrew.butkus
andrew.butkus

Reputation: 777

http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.postagger.training.tool

take a look here, you can create your bin file directly via the opennlp application, the commands are given on the site.

Upvotes: 1

Related Questions