How to generate custom training data for Stanford relation extraction

Question

I have trained a custom classifier to understand named entities in finance domain. I want to generate custom training data like shown in below link http://cogcomp.cs.illinois.edu/Data/ER/conll04.corp

I can mark the custom relation by hand but want to generate the data format like conll first with my custom named entities.

I have also tried the parser in the following way but that does not generate the relation training data like Roth and Yih's data mentioned in link https://nlp.stanford.edu/software/relationExtractor.html#training.

java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz stanford-parser-full-2013-06-20/data/testsent.txt >testsent.tree

java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile testsent.tree -conllx

Following is the output of custom ner run separate with the following python code

'java -mx2g -cp "*" edu.stanford.nlp.ie.NERClassifierCombiner '\
                '-ner.model classifiers\custom-model.ser.gz '\
                'classifiers/english.all.3class.distsim.crf.ser.gz,'\
                'classifiers/english.conll.4class.distsim.crf.ser.gz,'\
                'classifiers/english.muc.7class.distsim.crf.ser.gz ' \
                '-textFile '+ outtxt_sent +  ' -outputFormat inlineXML  > ' + outtxt + '.ner'

output:

Charles Sinclair Chairman -LRB- age 68 -RRB- Charles was appointed a non-executive director in

So the NER is working standalone fine even i have java code to test it out.

Here is the detailed code for relation data generation

Properties props = new Properties();
        props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
        props.setProperty("ner.model", "classifiers/custom-model.ser.gz,classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz");
        // set up Stanford CoreNLP pipeline
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        // build annotation for a review
        Annotation annotation = new Annotation("Charles Sinclair Chairman -LRB- age 68 -RRB- Charles was appointed a non-executive director");
        pipeline.annotate(annotation);
        int sentNum = 0;

.............. Rest of the code is same as yours

output:
0   PERSON  0   O   NNP/NNP Charles/Sinclair    O   O   O
0   PERSON  1   O   NNP Chairman    O   O   O
0   PERSON  2   O   -LRB-/NN/CD/-RRB-/NNP/VBD/VBN/DT    -LRB-/age/68/-RRB-/Charles/was/appointed/a  O   O   O
0   PERSON  3   O   JJ/NN   non-executive/director  O   O   O

O   3   member_of_board //I will modify the relation once the data generated with proper NER

The Ner tagging is ok now.  
 props.setProperty("ner.model", "classifiers/classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,");

Custom NER problem solved.

StanfordNLPHelp · Accepted Answer

This link shows an example of the data: http://cogcomp.cs.illinois.edu/Data/ER/conll04.corp

I don't think there is a way to produce this in Stanford CoreNLP.

After you tag the data, you need to loop through the sentences and print out the tokens in that same format, including the part-of-speech tag and the ner tag. It appears most of the columns have a "O" in them.

For each sentence that has a relationship you need to print out the a line after the sentence in the relation format. For instance this line indicates the previous sentence has the Live_In relationship:

7    0    Live_In

Here is some example code to generate the output for a sentence. You will need to set the pipeline to use your ner model instead by setting the ner.model property to the path of your custom model. WARNING: There may be some bugs in this code, but it should show how to access the data you need from the StanfordCoreNLP data structures.

package edu.stanford.nlp.examples;

import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;

import java.util.*;
import java.util.stream.Collectors;

public class CreateRelationData {

  public static void main(String[] args) {
    // set up pipeline properties
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
    // set up Stanford CoreNLP pipeline
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    // build annotation for a review
    Annotation annotation = new Annotation("Joe Smith lives in Hawaii.");
    pipeline.annotate(annotation);
    int sentNum = 0;
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      int tokenNum = 1;
      int elementNum = 0;
      int entityNum = 0;
      CoreMap currEntityMention = sentence.get(CoreAnnotations.MentionsAnnotation.class).get(entityNum);
      String currEntityMentionWords = currEntityMention.get(CoreAnnotations.TokensAnnotation.class).stream().map(token -> token.word()).
          collect(Collectors.joining("/"));
      String currEntityMentionTags =
          currEntityMention.get(CoreAnnotations.TokensAnnotation.class).stream().map(token -> token.tag()).
              collect(Collectors.joining("/"));
      String currEntityMentionNER = currEntityMention.get(CoreAnnotations.EntityTypeAnnotation.class);
      while (tokenNum <= sentence.get(CoreAnnotations.TokensAnnotation.class).size()) {
        if (currEntityMention.get(CoreAnnotations.TokensAnnotation.class).get(0).index() == tokenNum) {
          String entityText = currEntityMention.toString();
          System.out.println(sentNum+"	"+currEntityMentionNER+"	"+elementNum+"	"+"O	"+currEntityMentionTags+"	"+
              currEntityMentionWords+"	"+"O	O	O");
          // update tokenNum
          tokenNum += (currEntityMention.get(CoreAnnotations.TokensAnnotation.class).size());
          // update entity if there are remaining entities
          entityNum++;
          if (entityNum < sentence.get(CoreAnnotations.MentionsAnnotation.class).size()) {
            currEntityMention = sentence.get(CoreAnnotations.MentionsAnnotation.class).get(entityNum);
            currEntityMentionWords = currEntityMention.get(CoreAnnotations.TokensAnnotation.class).stream().map(token -> token.word()).
                collect(Collectors.joining("/"));
            currEntityMentionTags =
                currEntityMention.get(CoreAnnotations.TokensAnnotation.class).stream().map(token -> token.tag()).
                    collect(Collectors.joining("/"));
            currEntityMentionNER = currEntityMention.get(CoreAnnotations.EntityTypeAnnotation.class);
          }
        } else {
          CoreLabel token = sentence.get(CoreAnnotations.TokensAnnotation.class).get(tokenNum-1);
          System.out.println(sentNum+"	"+token.ner()+"	"+elementNum+"	O	"+token.tag()+"	"+token.word()+"	"+"O	O	O");
          tokenNum += 1;
        }
        elementNum += 1;
      }
      sentNum++;
    }
    System.out.println();
    System.out.println("O	3	Live_In");
  }
}

How to generate custom training data for Stanford relation extraction

Answers (1)

Related Questions