Reputation: 141
I am new to opennlp , need help to customize the parser
I have the used the opennlp parser with the pre-trained model en-pos-maxtent.bin to tag new raw english sentences with the corresponding parts fo speech, now i would like to customize the tags.
example sentence: Dog jumped over the wall.
after POS tagging by using en-pos-maxtent.bin , the result would be
Dog - NNP
jumped - VBD
over - IN
the - DT
wall - NN
but i want to train my own model and tag the words with my custom tags like
DOG - PERP
jumped - ACT
over - OTH
the - OTH
wall - OBJ
where PERP, ACT,OTH,OBJ are the tags that suit my necessities. is this possible ?
I checked the section of their documentation, they have given code to train a model and use it later on , the code goes like this
try {
dataIn = new FileInputStream("en-pos.train");
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), null, null);
}
catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
}
I am not able to understand what this "en-pos.train" is ?
what is the format of this file ? can we specify the custom tags here or what exactly this file is ?
any help would be appreciated
Thanks
Upvotes: 4
Views: 4354
Reputation: 670
Here is a detailed tutorial with full code:
https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php
Depending upon your domain, you can build a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like POS tagger can help make the process much easier.
Training data format
Training data is passed as a text file where each line is one data item. Each word in the line should be labeled in a format like "word_LABEL", the word and the label name is separated by an underscore '_'.
anki_Brand overdrive_Brand
just_ModelName dance_ModelName 2018_ModelName
aoc_Brand 27"_ScreenSize monitor_Category
horizon_ModelName zero_ModelName dawn_ModelName
cm_Unknown 700_Unknown modem_Category
computer_Category
Train model
The important class here is POSModel, which holds the actual model. We use class POSTaggerME to do the model building. Below is the code to build a model from training data file
public POSModel train(String filepath) {
POSModel model = null;
TrainingParameters parameters = TrainingParameters.defaultParams();
parameters.put(TrainingParameters.ITERATIONS_PARAM, "100");
try {
try (InputStream dataIn = new FileInputStream(filepath)) {
ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
@Override
public InputStream createInputStream() throws IOException {
return dataIn;
}
}, StandardCharsets.UTF_8);
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
model = POSTaggerME.train("en", sampleStream, parameters, new POSTaggerFactory());
return model;
}
}
catch (Exception e) {
e.printStackTrace();
}
return null;
}
Use model to do tagging.
Finally, we can see how the model can be used to tag unseen queries:
public void doTagging(POSModel model, String input) {
input = input.trim();
POSTaggerME tagger = new POSTaggerME(model);
Sequence[] sequences = tagger.topKSequences(input.split(" "));
for (Sequence s : sequences) {
List<String> tags = s.getOutcomes();
System.out.println(Arrays.asList(input.split(" ")) +" =>" + tags);
}
}
Upvotes: 0
Reputation: 1664
It's documented at http://opennlp.apache.org/documentation/manual/opennlp.html#tools.postagger.training - one sentence per line, and the words are separated from their tags by an underscore:
About_IN 10_CD Euro_NNP ,_, I_PRP reckon_VBP ._.
That_DT sounds_VBZ good_JJ ._.
Upvotes: 4