Reputation: 902
I am using Apache OpenNLP for one of my projects.I am creating a new model to identify location since the pre-trained model (en-ner-location.bin) does not have this location.
Here is the code :
package com.equinox.nlp;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Collections;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Map;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.NameSample;
import opennlp.tools.namefind.NameSampleDataStream;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.Span;
public class NlpTesting {
protected Map<String, NameFinderME> finders;
protected Tokenizer tokenizer;
public static void main(String[] args) throws InvalidFormatException,
IOException {
String bankura = "In the 2011 census, Bankura municipality had a population of 138,036, out of which 70,734 were males and 67,302 were females.";
String london = "London is the capital city of England and the United Kingdom.";
NlpTesting nlpTesting = new NlpTesting();
NameFinderME nameFinderA = nlpTesting.createNameFinder("./opennlp-models/en-ner-location.bin");
nlpTesting.findLocation(london, nameFinderA);
System.out.println("--------------------------");
nlpTesting.findLocation(bankura, nameFinderA);
nlpTesting.train();
NameFinderME nameFinderB = nlpTesting.createNameFinder("./opennlp-models/en-ner-custom-location.bin");
nlpTesting.findLocation(bankura, nameFinderB);
}
public String findLocation(String str,NameFinderME nameFinder) throws InvalidFormatException,
IOException {
String commaSeparatedLocationNames = "";
tokenizer = SimpleTokenizer.INSTANCE;
String tokens[] = tokenizer.tokenize(str);
Span nameSpans[] = nameFinder.find(tokens);
HashSet<String> locationSet = new HashSet<String>();
for (int i = 0; i < nameSpans.length; i++) {
locationSet.add(tokens[nameSpans[i].getStart()]);
}
for (Iterator<String> iterator = locationSet.iterator(); iterator
.hasNext();) {
String location = iterator.next();
commaSeparatedLocationNames += location + ",";
}
System.out.println(commaSeparatedLocationNames);
return commaSeparatedLocationNames;
}
public void train() throws IOException {
File trainerFile = new File("./train/train.txt");
File output = new File("./opennlp-models/en-ner-custom-location.bin");
ObjectStream<String> lineStream = new PlainTextByLineStream(
new FileInputStream(trainerFile), "UTF-8");
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
lineStream);
System.out.println("lineStream = " + lineStream);
TokenNameFinderModel model = NameFinderME.train("en", "location",
sampleStream, Collections.<String, Object> emptyMap());
BufferedOutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(output));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
public NameFinderME createNameFinder(String str) throws InvalidFormatException,
FileNotFoundException, IOException {
NameFinderME nameFinder = new NameFinderME(new TokenNameFinderModel(
new FileInputStream(new File(str))));
return nameFinder;
}
}
So far, it works fine.
The issue is I am unable to add another location to this custom model that I have created. So, I went through the OpenNLP - README document.
There, it says, "Note: In order to train a model you need all the training data. There is not currently a mechanism to update the models distributed with the project with additional data. "
Does that mean I will not be able to update my custom models as well ? Is there any way to do this? It is quite possible that I may not have all the data while creating a model and an option to update the model should be there.Please help me.
Upvotes: 1
Views: 428
Reputation: 11474
It means exactly it says: you will need to retrain your entire model from scratch every time you want to add new training instances.
If you need to update models without retraining then OpenNLP is not the right tool for your task.
Upvotes: 2