Reputation: 51
I have trained a custom classifier to understand named entities in finance domain. I want to generate custom training data like shown in below link http://cogcomp.cs.illinois.edu/Data/ER/conll04.corp
I can mark the custom relation by hand but want to generate the data format like conll first with my custom named entities.
I have also tried the parser in the following way but that does not generate the relation training data like Roth and Yih's data mentioned in link https://nlp.stanford.edu/software/relationExtractor.html#training.
java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz stanford-parser-full-2013-06-20/data/testsent.txt >testsent.tree
java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile testsent.tree -conllx
Following is the output of custom ner run separate with the following python code
'java -mx2g -cp "*" edu.stanford.nlp.ie.NERClassifierCombiner '\
'-ner.model classifiers\custom-model.ser.gz '\
'classifiers/english.all.3class.distsim.crf.ser.gz,'\
'classifiers/english.conll.4class.distsim.crf.ser.gz,'\
'classifiers/english.muc.7class.distsim.crf.ser.gz ' \
'-textFile '+ outtxt_sent + ' -outputFormat inlineXML > ' + outtxt + '.ner'
output:
<PERSON>Charles Sinclair</PERSON> <DESG>Chairman</DESG> <ORGANIZATION>-LRB- age 68 -RRB- Charles was appointed a</ORGANIZATION> <DESG>non-executive director</DESG> <ORGANIZATION>in</ORGANIZATION>
So the NER is working standalone fine even i have java code to test it out.
Here is the detailed code for relation data generation
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
props.setProperty("ner.model", "classifiers/custom-model.ser.gz,classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz");
// set up Stanford CoreNLP pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// build annotation for a review
Annotation annotation = new Annotation("Charles Sinclair Chairman -LRB- age 68 -RRB- Charles was appointed a non-executive director");
pipeline.annotate(annotation);
int sentNum = 0;
.............. Rest of the code is same as yours
output:
0 PERSON 0 O NNP/NNP Charles/Sinclair O O O
0 PERSON 1 O NNP Chairman O O O
0 PERSON 2 O -LRB-/NN/CD/-RRB-/NNP/VBD/VBN/DT -LRB-/age/68/-RRB-/Charles/was/appointed/a O O O
0 PERSON 3 O JJ/NN non-executive/director O O O
O 3 member_of_board //I will modify the relation once the data generated with proper NER
The Ner tagging is ok now.
props.setProperty("ner.model", "classifiers/classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,");
Custom NER problem solved.
Upvotes: 0
Views: 948
Reputation: 8739
This link shows an example of the data: http://cogcomp.cs.illinois.edu/Data/ER/conll04.corp
I don't think there is a way to produce this in Stanford CoreNLP.
After you tag the data, you need to loop through the sentences and print out the tokens in that same format, including the part-of-speech tag and the ner tag. It appears most of the columns have a "O" in them.
For each sentence that has a relationship you need to print out the a line after the sentence in the relation format. For instance this line indicates the previous sentence has the Live_In relationship:
7 0 Live_In
Here is some example code to generate the output for a sentence. You will need to set the pipeline to use your ner
model instead by setting the ner.model
property to the path of your custom model. WARNING: There may be some bugs in this code, but it should show how to access the data you need from the StanfordCoreNLP data structures.
package edu.stanford.nlp.examples;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;
import java.util.*;
import java.util.stream.Collectors;
public class CreateRelationData {
public static void main(String[] args) {
// set up pipeline properties
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
// set up Stanford CoreNLP pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// build annotation for a review
Annotation annotation = new Annotation("Joe Smith lives in Hawaii.");
pipeline.annotate(annotation);
int sentNum = 0;
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
int tokenNum = 1;
int elementNum = 0;
int entityNum = 0;
CoreMap currEntityMention = sentence.get(CoreAnnotations.MentionsAnnotation.class).get(entityNum);
String currEntityMentionWords = currEntityMention.get(CoreAnnotations.TokensAnnotation.class).stream().map(token -> token.word()).
collect(Collectors.joining("/"));
String currEntityMentionTags =
currEntityMention.get(CoreAnnotations.TokensAnnotation.class).stream().map(token -> token.tag()).
collect(Collectors.joining("/"));
String currEntityMentionNER = currEntityMention.get(CoreAnnotations.EntityTypeAnnotation.class);
while (tokenNum <= sentence.get(CoreAnnotations.TokensAnnotation.class).size()) {
if (currEntityMention.get(CoreAnnotations.TokensAnnotation.class).get(0).index() == tokenNum) {
String entityText = currEntityMention.toString();
System.out.println(sentNum+"\t"+currEntityMentionNER+"\t"+elementNum+"\t"+"O\t"+currEntityMentionTags+"\t"+
currEntityMentionWords+"\t"+"O\tO\tO");
// update tokenNum
tokenNum += (currEntityMention.get(CoreAnnotations.TokensAnnotation.class).size());
// update entity if there are remaining entities
entityNum++;
if (entityNum < sentence.get(CoreAnnotations.MentionsAnnotation.class).size()) {
currEntityMention = sentence.get(CoreAnnotations.MentionsAnnotation.class).get(entityNum);
currEntityMentionWords = currEntityMention.get(CoreAnnotations.TokensAnnotation.class).stream().map(token -> token.word()).
collect(Collectors.joining("/"));
currEntityMentionTags =
currEntityMention.get(CoreAnnotations.TokensAnnotation.class).stream().map(token -> token.tag()).
collect(Collectors.joining("/"));
currEntityMentionNER = currEntityMention.get(CoreAnnotations.EntityTypeAnnotation.class);
}
} else {
CoreLabel token = sentence.get(CoreAnnotations.TokensAnnotation.class).get(tokenNum-1);
System.out.println(sentNum+"\t"+token.ner()+"\t"+elementNum+"\tO\t"+token.tag()+"\t"+token.word()+"\t"+"O\tO\tO");
tokenNum += 1;
}
elementNum += 1;
}
sentNum++;
}
System.out.println();
System.out.println("O\t3\tLive_In");
}
}
Upvotes: 1