Reputation: 11
I am training a model for named entity recognition but it is not properly identifying the names of person?
my training data looks like:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 . A nonexecutive director has many similar responsibilities as an executive director.However, there are no voting rights with this position.`
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V., the Dutch publishing group.
The former chairman of the society <START:person> Rudolph Agnew <END> will be assisting <START:person> Vinken <End> in his activities.
Mr . <START:person> Vinken <END> is the most right person in the industry.
His competitior <START:person> Steve <END> is vice chairman of Himbeldon N.V., the Ericson publishing group.
<START:person> Vinken <END> will also be assisted by <START:person> Angelina Tucci <END> who has been recognized many times For Her Good Work.
<START:person> Juilie <END> vp of Weterwood A.B., THE ZS publishing group also supported him.
Mr . <START:person> Stewart <END> is a recruiter of Metric C.D., the Drishti publishing.
He recruited <START:person> Adam <END> who will work on nlp for <START:person> Vinken <END> .
The lead conference for appointing him as a director was held by <START:person> Daniel Smith <END> at Boston.
The java file for training the model is:
public class NamedEntityModel {
public static void train(String inputfile,String modelfile) throws IOException {
Charset charset = Charset.forName("UTF-8");
MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory( new File(inputfile));
ObjectStream<String> lineStream = new PlainTextByLineStream( factory, charset);
ObjectStream<NameSample> sampleStream = new NameSampleDataStream( lineStream);
TokenNameFinderModel model = null;
try {
model = NameFinderME.train("en", "person", sampleStream,TrainingParameters.defaultParams(),
new TokenNameFinderFactory());
} finally {
sampleStream.close();
}
BufferedOutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelfile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
}
And this is how the main class looks:
public class NameFinder {
public static void main(String [] args) throws IOException{
String inputfile="C:/setup/apache-opennlp-1.7.2/bin/ner_training_data.txt";
String modelfile="C:/setup/apache-opennlp-1.7.2/bin/en-tr-ner-person.bin";
NamedEntityModel.train(inputfile, modelfile);
String sentence ="Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . Mr . Vinken is chairman of Elsevier N.V. , the Dutch publishing group. Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate . Peter is on leave today . "
+ "Steve is his competitor . Daniel Smith lead the ceremony. Kristen is svery happpy to know about it. Thomas will u please look into the matter as Ruby is busy";
WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;
//Tokenizing the given paragraph
String tokens[] = whitespaceTokenizer.tokenize(sentence);
for(String str:tokens)
System.out.println(str);
InputStream inputStreamNameFinder = new FileInputStream(modelfile);
TokenNameFinderModel model = new TokenNameFinderModel(inputStreamNameFinder);
NameFinderME nameFinder = new NameFinderME(model);
Span nameSpans[] = nameFinder.find(tokens);
System.out.println(Arrays.toString(Span.spansToStrings(nameSpans, tokens)));
for(Span s: nameSpans)
System.out.println(s.toString()+" "+tokens[s.getStart()]);
}
}
And the output is:
[Pierre Vinken, Vinken, Peter, Steve, Daniel Smith, Kristen, Thomas]
This trained model is not able to recognize names like Rudolph Agnew and Ruby. How to train it more accurately so that it is able to recognize the names more correctly ?
Upvotes: 1
Views: 689
Reputation: 1281
+1 to the answer of @caffeinator13. Also, there are some params (https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/TrainingParameters.html) (link is to older version, but I guess there params are still in more recent versions), that control number of iterations and (perhaps more relevant to you) cutoff, i.e. number of times an entity has to appear in the training data to be considered for recognition. This setting more or less controls precision vs. recall and maybe you should set it a bit more lenient (not sure what the default was again). So instead of using the defaultparams, you could try:
TrainingParameters tp = new TrainingParameters();
tp.put(TrainingParameters.CUTOFF_PARAM, "1");
tp.put(TrainingParameters.ITERATIONS_PARAM, "100");
TokenNameFinderFactory tnff = new TokenNameFinderFactory();
model = NameFinderME.train(language, modelName, sampleStream, tp, tnff);
Upvotes: 1
Reputation: 986
According to opennlp documentation, The training data should contain at least 15000 sentences to create a model which performs well. So, train it with more data and try giving the names in different rather than keeping the test data same as the training data!
Upvotes: 0