Stanford JavaNLP RegexNERAnnotator Apostrophe

Question

RegexNERAnnotator cannot seem to identify apostrophes.

    Properties properties = new Properties();
    properties.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions,regexner,tokensregex");
    properties.put("regexner.mapping", "regexfile.txt");
    properties.put("regexner.ignorecase", "true");

    StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);

In regexfile.txt,

Bachelor of (Arts|Laws|Science|Engineering) DEGREE
Lalor   LOCATION    PERSON
Labor   ORGANIZATION

It is able to identify Bachelor of Arts. Unfortunately, after i changed it to,

Bachelor's of (Arts|Laws|Science|Engineering)   DEGREE
Lalor   LOCATION    PERSON
Labor   ORGANIZATION

It will not be able to identify Bachelor's of Arts as a DEGREE.

Any help will be greatly appreciated. Thanks in advance. :)

alsora · Accepted Answer

The RegexNERAnnotator requires the tokenizer in order to work.

Consider a sentence containing the phrase "Bachelor's of Arts". The tokenization process will divide the word Bachelor from the apostrophe, creating two different tokens.

Within the tab separated file regexfile.txt, whitespaces denote a new token. This means that your custom rule will only match a token which is exactly the word "Bachelor's". This will not happen due to the tokenizer.

Write rules where each token you want to match is separated with a whitespace and everything will work.

Bachelor 's of (Arts|Laws|Science|Engineering)   DEGREE
Lil ' Jon    RAPPER

Stanford JavaNLP RegexNERAnnotator Apostrophe

Answers (1)

Related Questions