Reputation: 1
I have a TSV file containing entities as regular expressions, to accomodate spelling variants and inflections. Is it possible to train a NER model using such a file or would it be necessary to manually expand the regexes to all possible spelling variants?
In the Java doc, I discovered the RegexNERSequenceClassifier, however for version 3.5.2 of the classifier, the indicated path edu.stanford.nlp.ie.regexp
(inside the jar file) does not contain this classifier.
Could this be done and if so, using a command-line call (as with edu.stanford.nlp.ie.NERClassifierCombiner
) or only programmatically?
Upvotes: 0
Views: 757
Reputation: 51
You can also do this in command line using StanfordCoreNLP
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -file input.txt -regexner.mapping regexner.txt edu.stanford.nlp.ie.NERClassifierCombiner -ner.model classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,classifiers/custom-model.ser.gz
Upvotes: 0
Reputation: 143
You can use the regexner
annotator in your pipeline. It is a standard annotator that works with both normal regex's as well as with the special CoreNLP TokenRegex's (depending on the syntax you use in your mapping file.)
Here's an example code snippet:
Properties pipelineProps = new Properties();
pipelineProps.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
pipelineProps.setProperty("regexner.mapping", "<comma separated list of files mapping regex's or TokenRegex's to NER tags>");
pipelineProps.setProperty("regexner.backgroundSymbol", "O,I-MISC,I-LOC,I-PER,I-ORG"); // NER tags that should be overwritten by the regexner annotator, if need be
pipelineProps.setProperty("regexner.ignorecase", "true");
pipeline = new StanfordCoreNLP(pipelineProps);
EDIT: I don't see any reason why this wouldn't be possible also via the command line (as described on http://nlp.stanford.edu/software/corenlp.shtml in the Usage section). Haven't tried the command line myself though, so cannot speak from experience.
Upvotes: 1
Reputation: 5749
You can take a look at TokensRegexNERAnnotator. You can define a mapping from TokensRegex expressions to NER tags, and then invoke the annotator as a custom annotator. For example, by putting the following in the properties file you pass to the StanfordCoreNLP
pipeline:
customAnnotatorClass.regexner = edu.stanford.nlp.pipeline.TokensRegexNERAnnotator
regexner.mapping = path_to_your_mapping.tab
regexner.validpospattern = ^(NN|JJ).* // optional
regexner.ignorecase = true // optional
annotators = tokenize,ssplit,pos,regexner
Upvotes: 1