Reputation: 143
RegexNERAnnotator cannot seem to identify apostrophes.
Properties properties = new Properties();
properties.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions,regexner,tokensregex");
properties.put("regexner.mapping", "regexfile.txt");
properties.put("regexner.ignorecase", "true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);
In regexfile.txt,
Bachelor of (Arts|Laws|Science|Engineering) DEGREE
Lalor LOCATION PERSON
Labor ORGANIZATION
It is able to identify Bachelor of Arts. Unfortunately, after i changed it to,
Bachelor's of (Arts|Laws|Science|Engineering) DEGREE
Lalor LOCATION PERSON
Labor ORGANIZATION
It will not be able to identify Bachelor's of Arts as a DEGREE.
Any help will be greatly appreciated. Thanks in advance. :)
Upvotes: 0
Views: 62
Reputation: 663
The RegexNERAnnotator requires the tokenizer in order to work.
Consider a sentence containing the phrase "Bachelor's of Arts". The tokenization process will divide the word Bachelor from the apostrophe, creating two different tokens.
Within the tab separated file regexfile.txt, whitespaces denote a new token. This means that your custom rule will only match a token which is exactly the word "Bachelor's". This will not happen due to the tokenizer.
Write rules where each token you want to match is separated with a whitespace and everything will work.
Bachelor 's of (Arts|Laws|Science|Engineering) DEGREE
Lil ' Jon RAPPER
Upvotes: 1