Reputation: 3136
I am using Stanford CoreNLP for NER for a list of short documents.
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP
-annotators tokenize,ssplit,pos,lemma,ner -ssplit.eolonly -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger
-ner.model edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz
-file .../input -outputDirectory .../stanford_ner
The problem is the CharacterOffsetBegin
and CharacterOffsetEnd
I get from each token are continuous number from the previous documents. Therefore for example the very first token of document_2 has a CharacterOffsetBegin
of 240 rather than 0. Is there any option I can use in the command line? Any help would be greatly appreciated, thanks!
Upvotes: 1
Views: 177
Reputation: 619
Yes--if you split your input into separate files. There's a -filelist
option for batch jobs. In your case, each line of the file list has a path to a document file. For example, if you have all of your separate doc files in a directory .../input
, then input.txt
contains something like:
.../input/doc_1.txt
.../input/doc_2.txt
.../input/doc_3.txt
Though it might be a good idea to put the full paths there if possible. Then, you'd execute CoreNLP as such:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP
-annotators tokenize,ssplit,pos,lemma,ner -ssplit.eolonly -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger
-ner.model edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz
-filelist .../input.txt -outputDirectory .../stanford_ner
If you write some script to split input
up into multiple documents, it would probably be a good idea to generate input.txt
concurrently.
This will restart the token counter for each document you process.
Upvotes: 1