liavek
liavek

Reputation: 1

Is it possible to train a Stanford NER model with a data set containing regular expressions?

I have a TSV file containing entities as regular expressions, to accomodate spelling variants and inflections. Is it possible to train a NER model using such a file or would it be necessary to manually expand the regexes to all possible spelling variants?

In the Java doc, I discovered the RegexNERSequenceClassifier, however for version 3.5.2 of the classifier, the indicated path edu.stanford.nlp.ie.regexp (inside the jar file) does not contain this classifier.

Could this be done and if so, using a command-line call (as with edu.stanford.nlp.ie.NERClassifierCombiner) or only programmatically?

Upvotes: 0

Views: 757

Answers (3)

Samrat Saha
Samrat Saha

Reputation: 51

You can also do this in command line using StanfordCoreNLP

java -cp "*" -Xmx2g  edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -file input.txt -regexner.mapping regexner.txt edu.stanford.nlp.ie.NERClassifierCombiner -ner.model classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,classifiers/custom-model.ser.gz

Upvotes: 0

miladydesummer
miladydesummer

Reputation: 143

You can use the regexner annotator in your pipeline. It is a standard annotator that works with both normal regex's as well as with the special CoreNLP TokenRegex's (depending on the syntax you use in your mapping file.) Here's an example code snippet:

Properties pipelineProps = new Properties();

pipelineProps.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");

pipelineProps.setProperty("regexner.mapping", "<comma separated list of files mapping regex's or TokenRegex's to NER tags>");
pipelineProps.setProperty("regexner.backgroundSymbol", "O,I-MISC,I-LOC,I-PER,I-ORG"); // NER tags that should be overwritten by the regexner annotator, if need be
pipelineProps.setProperty("regexner.ignorecase", "true");

pipeline = new StanfordCoreNLP(pipelineProps);

EDIT: I don't see any reason why this wouldn't be possible also via the command line (as described on http://nlp.stanford.edu/software/corenlp.shtml in the Usage section). Haven't tried the command line myself though, so cannot speak from experience.

Upvotes: 1

Gabor Angeli
Gabor Angeli

Reputation: 5749

You can take a look at TokensRegexNERAnnotator. You can define a mapping from TokensRegex expressions to NER tags, and then invoke the annotator as a custom annotator. For example, by putting the following in the properties file you pass to the StanfordCoreNLP pipeline:

customAnnotatorClass.regexner = edu.stanford.nlp.pipeline.TokensRegexNERAnnotator                                                          
regexner.mapping = path_to_your_mapping.tab                                                     
regexner.validpospattern = ^(NN|JJ).*   // optional                                                                                                              
regexner.ignorecase = true    // optional
annotators = tokenize,ssplit,pos,regexner

Upvotes: 1

Related Questions