Reputation: 363
I'm currently using the OpenIE system from Stanford CoreNLP using its Java command line interface
java -mx32g -cp stanford-corenlp-3.8.0.jar:stanford-corenlp-3.8.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar edu.stanford.nlp.naturalli.OpenIE test_file.txt -threads 8 -resolve_coref true
My test file contains 50,000 sentences, one per line.
The OpenIE result would be a list of tuples for all sentences. Is there a flag which I can set to have a correspondence between each tuple and the particular sentence? (e.g., some sentences may have no extractions, some may have more than one. How can I know which is which?)
My current solution is to have 50,000 files, with one sentence per file. But this is incredibly slow, as models have to be reloaded with every file.
Thanks.
Edit:
I realized that the -filelist flag makes processing much faster, which is a good thing. But the output unfortunately still does not differentiate between the different files.
Upvotes: 0
Views: 1674
Reputation: 5749
You should be able to get sentence info if you output using the Reverb format (-format reverb
). In addition, I expect you'll want to force the tokenizer to split sentences on newlines (-ssplit.newlineIsSentenceBreak always
). For example, the following command should work, adapted from your example:
java -mx8g -cp stanford-corenlp-3.8.0.jar:stanford-corenlp-3.8.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar \
edu.stanford.nlp.naturalli.OpenIE \
-threads 8 -resolve_coref true \
-ssplit.newlineIsSentenceBreak always \
-format reverb \
input.txt
For the following input file:
George Bush was born in Texas
Obama was born in Hawaii
I get the following output on stdout (you can redirect it to a file with the -output <filename>
flag):
input.txt 0 George Bush was born 0 2 2 3 3 4 1.000 George Bush was born in Texas NNP NNP VBD VBN IN NNP George Bush be bear
input.txt 0 George Bush was born in Texas 0 2 2 5 5 1.000 George Bush was born in Texas NNP NNP VBD VBN IN NNP George Bush be bear in Texas
input.txt 1 Obama was born in Hawaii 0 1 1 4 4 5 1.000 Obama was born in Hawaii NNP VBD VBN IN NNP Obama be bear in Hawaii
input.txt 1 Obama was born 0 1 1 2 2 3 1.000 Obama was born in Hawaii NNP VBD VBN IN NNP Obama be bear
The second line is the sentence index; the full list of tab-separated columns is documented on the ReVerb README:
Upvotes: 3