Reputation: 18459
I have text that is already tokenized, sentence-split, and POS-tagged.
I would like to use CoreNLP to additionally annotate lemmas (lemma
), named entities (ner
), contituency and dependency parse (parse
), and coreferences (dcoref
).
Is there a combination of commandline options and option file specifications that makes this possible from the command line?
According to this question, I can ask the parser to view whitespace as delimiting tokens, and newlines as delimiting sentences by adding this to my properties file:
tokenize.whitespace = true
ssplit.eolonly = true
This works well, so all that remains is to specify to CoreNLP that I would like to provide POS tags too.
When using the Stanford Parser standing alone, it seems to be possible to have it use existing POS tags, but copying that syntax to the invocation of CoreNLP doesn't seem to work. For example, this does not work:
java -cp *:./* -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props my-properties-file -outputFormat xml -outputDirectory my-output-dir -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -file my-annotated-text.txt
While this question covers programmatic invocation, I'm invoking CoreNLP form the commandline as part of a larger system, so I'm really asking whether this is possible to achieve this with commandline options.
Upvotes: 2
Views: 905
Reputation: 8739
I don't think this is possible with command line options.
If you want you can make a custom annotator and include it in your pipeline you could go that route.
Here is some sample code:
package edu.stanford.nlp.pipeline;
import edu.stanford.nlp.util.logging.Redwood;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.concurrent.MulticoreWrapper;
import edu.stanford.nlp.util.concurrent.ThreadsafeProcessor;
import java.util.*;
public class ProvidedPOSTaggerAnnotator {
public String tagSeparator;
public ProvidedPOSTaggerAnnotator(String annotatorName, Properties props) {
tagSeparator = props.getProperty(annotatorName + ".tagSeparator", "_");
}
public void annotate(Annotation annotation) {
for (CoreLabel token : annotation.get(CoreAnnotations.TokensAnnotation.class)) {
int tagSeparatorSplitLength = token.word().split(tagSeparator).length;
String posTag = token.word().split(tagSeparator)[tagSeparatorSplitLength-1];
String[] wordParts = Arrays.copyOfRange(token.word().split(tagSeparator), 0, tagSeparatorSplitLength-1);
String tokenString = String.join(tagSeparator, wordParts);
// set the word with the POS tag removed
token.set(CoreAnnotations.TextAnnotation.class, tokenString);
// set the POS
token.set(CoreAnnotations.PartOfSpeechAnnotation.class, posTag);
}
}
}
This should work if you provide your token with POS tokens separated by "_". You can change it with the forcedpos.tagSeparator property.
If you set customAnnotator.forcedpos = edu.stanford.nlp.pipeline.ProvidedPOSTaggerAnnotator
to the property file, include the above class in your CLASSPATH, and then include "forcedpos" in your list of annotators after "tokenize", you should be able to pass in your own pos tags.
I may clean this up some more and actually include it in future releases for people!
I have not had time to actually test this code out, if you try it out and find errors please let me know and I'll fix it!
Upvotes: 2