Reputation: 213
I've used grep and awk to extract named entities from Stanford CRF-NER 'inline XML' for English texts, and I was hoping to use my same larger workflow for other human languages.
I've been experimenting a bit with French (Spanish seems to throw a Java error for me, which is another story), and using java -cp stanford-corenlp-4.0.0/stanford-corenlp-4.0.0.jar:stanford-corenlp-4.0.0-models-french.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -properties StanfordCoreNLP-french.properties -file french.txt -outputFormat text
I get the standard text output that has every type of annotation broken out for each sentence, including multi-word entities that are correctly grouped together, like this:
Extracted the following NER entity mentions:
Puget Sound LOC I-LOC:0.9822963367809222
lac Washington LOC I-LOC:0.9908561818309122
Canada LOC I-LOC:0.9804363858330243
États-Unis LOC I-LOC:0.9973224740712531
I know it's possible to parse that, but it seems like a lot of wasted processing when I really just want a list of entities from the entire file.
I've also been able to get the columns of words and ner type using java -cp stanford-corenlp-4.0.0/stanford-corenlp-4.0.0.jar:stanford-corenlp-4.0.0-models-french.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -properties StanfordCoreNLP-french.properties -file french.txt -output.columns word,ner -outputFormat conll
Puget I-LOC
Sound I-LOC
et O
le O
lac I-LOC
Washington I-LOC
, O
à O
environ O
155 O
km O
à O
le O
sud O
de O
la O
frontière O
entre O
le O
Canada I-LOC
et O
les O
États-Unis I-LOC
. O
In addition to being a little messy, this breaks apart multi-word entities, making it impossible to stitch back together at scale.
I would prefer to get the inline xml (e.g. <LOCATION>Puget</LOCATION><LOCATION>Sound</LOCATION>
) since I've already developed a workflow to use that, but if that's not possible, is there at least a way to get a TSV output (like the conll
version earlier) that groups multi-word entities together like in the text output?
I have looked into the entity mentions annotator, but I haven't been able to figure it out, and if it requires training, then I'd rather not use it. The default text output's grouping is good enough for my needs.
Upvotes: 0
Views: 526
Reputation: 8739
I added inlineXML
as an ouputFormat
option in the latest code on GitHub. This change is not available in version 4.1.0 which just came out. There are instructions on the GitHub site about how to build the code into a jar.
GitHub site: https://github.com/stanfordnlp/CoreNLP
Upvotes: 1