Reputation: 141
I am trying to train the opennlp POS tagger which would tag the words in a sentence according to my specific vocabulary.for example :
After normal POS tagging:
sentence: NodeManager/NNP failed/VBD to/TO start/VB the/DT server/NN
After using my model of pos tagging :
sentence: NodeManager/AGENT failed/OTHER to/OTHER start/OTHER the/OTHER server/OBJECT
where AGENT,OTHER,OBJECT are the tags tat i defined.
so basically i am defining my own tag dictionary.And want the POS tagger to use my model.
wen i checked in the apache documentation for doing this
i found the below code
POSModel model = null;
InputStream dataIn = null;
try {
dataIn = new FileInputStream("en-pos.train");
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), null, null);
}
catch(IOException e)
{
e.printStackTrace();
}
finally {
if (dataIn != null) {
try {
dataIn.close();
}
catch (IOException e) {
// Not an issue, training already finished.
// The exception should be logged and investigated
// if part of a production system.
e.printStackTrace();
}
}
}
here when they are opening the FileInputStream to en-pos.train, i guess this en-pos.train is a .bin file like all the ones they have used before , but just that it is customized. can someone tell me how to get the .bin file for it ?
or where is en-pos.train ? what exactly is it? how to create it?
i extracted the bin file tat they normally use
en-pos-maxent.bin. it has the xml file where we define the tag dictionary, a model file and a properties file. i have changed them according to my needs , but my problem is generating the .bin file from the contents.
Upvotes: 1
Views: 1872
Reputation: 670
Its pretty simple to do:
Once you train your own model, dump it to a file (call it whatever you want):
public void writeToFile(POSModel model, String modelOutpath) {
try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelOutpath))) {
model.serialize(modelOut);
}
catch (Exception e) {
e.printStackTrace();
}
}
Then load the file as shown below:
public POSModel getModel(String modelPath) {
try {
try (InputStream modelIn = new FileInputStream(modelPath)) {
POSModel model = new POSModel(modelIn);
return model;
}
}
catch (Exception e) {
e.printStackTrace();
}
return model;
}
Now you can use the loaded model and do tagging.
public void doTagging(POSModel model, String input) {
input = input.trim();
POSTaggerME tagger = new POSTaggerME(model);
Sequence[] sequences = tagger.topKSequences(input.split(" "));
for (Sequence s : sequences) {
List<String> tags = s.getOutcomes();
System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
}
}
Here is a detailed tutorial with the full code on how to train and use your own Open NLP based POS tagger:
https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php
Upvotes: 0
Reputation: 777
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.postagger.training.tool
take a look here, you can create your bin file directly via the opennlp application, the commands are given on the site.
Upvotes: 1