charpi
charpi

Reputation: 363

Stanford CoreNLP OpenIE by sentence?

I'm currently using the OpenIE system from Stanford CoreNLP using its Java command line interface

java -mx32g -cp stanford-corenlp-3.8.0.jar:stanford-corenlp-3.8.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar edu.stanford.nlp.naturalli.OpenIE test_file.txt -threads 8 -resolve_coref true

My test file contains 50,000 sentences, one per line.

The OpenIE result would be a list of tuples for all sentences. Is there a flag which I can set to have a correspondence between each tuple and the particular sentence? (e.g., some sentences may have no extractions, some may have more than one. How can I know which is which?)

My current solution is to have 50,000 files, with one sentence per file. But this is incredibly slow, as models have to be reloaded with every file.

Thanks.

Edit:

I realized that the -filelist flag makes processing much faster, which is a good thing. But the output unfortunately still does not differentiate between the different files.

Upvotes: 0

Views: 1674

Answers (1)

Gabor Angeli
Gabor Angeli

Reputation: 5749

You should be able to get sentence info if you output using the Reverb format (-format reverb). In addition, I expect you'll want to force the tokenizer to split sentences on newlines (-ssplit.newlineIsSentenceBreak always). For example, the following command should work, adapted from your example:

java -mx8g -cp stanford-corenlp-3.8.0.jar:stanford-corenlp-3.8.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar \
    edu.stanford.nlp.naturalli.OpenIE \
    -threads 8 -resolve_coref true \
    -ssplit.newlineIsSentenceBreak always \
    -format reverb \
    input.txt

For the following input file:

George Bush was born in Texas 
Obama was born in Hawaii

I get the following output on stdout (you can redirect it to a file with the -output <filename> flag):

input.txt   0   George Bush was born    0   2   2   3   3   4   1.000   George Bush was born in Texas   NNP NNP VBD VBN IN NNP  George Bush be  bear
input.txt   0   George Bush was born in Texas   0   2   2   5   5   1.000   George Bush was born in Texas   NNP NNP VBD VBN IN NNP  George Bush be bear in  Texas
input.txt   1   Obama   was born in Hawaii  0   1   1   4   4   5   1.000   Obama was born in Hawaii    NNP VBD VBN IN NNP  Obama   be bear in  Hawaii
input.txt   1   Obama   was born    0   1   1   2   2   3   1.000   Obama was born in Hawaii    NNP VBD VBN IN NNP  Obama   be  bear

The second line is the sentence index; the full list of tab-separated columns is documented on the ReVerb README:

  1. The filename (or stdin if the source is standard input)
  2. The sentence number this extraction came from.
  3. Argument1 words, space separated
  4. Relation phrase words, space separated
  5. Argument2 words, space separated
  6. The start index of argument1 in the sentence. For example, if the value is i, then the first word of argument1 is the i-1th word in the sentence.
  7. The end index of argument1 in the sentence. For example, if the value is j, then the last word of argument1 is the jth word in the sentence.
  8. The start index of relation phrase.
  9. The end index of relation phrase.
  10. The start index of argument2.
  11. The end index of argument2.
  12. The confidence that this extraction is correct. The higher the number, the more trustworthy this extraction is.
  13. The words of the sentence this extraction came from, space-separated.
  14. The part-of-speech tags for the sentence words, space-separated.
  15. The chunk tags for the sentence words, space separated. These represent a shallow parse of the sentence.
  16. A normalized version of arg1. See the BinaryExtractionNormalizer javadoc for details about how the normalization is done.
  17. A normalized version of rel.
  18. A normalized version of arg2.

Upvotes: 3

Related Questions