mmihaltz
mmihaltz

Reputation: 101

Stanford TokensRegex: how to set normalized annotation using normalized output of NER annotation?

I am creating a TokensRegex annotator to extract the number of floors a building has (just an example to illustrate my question). I have a simple pattern that will recognize both "4 floors" and "four floors" as instances of my custom entity "FLOORS". I would also like to add a NormalizedNER annotation, using the normalized value of the number entity used in the expression, but I can't get it to work the way I want to:

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
normalized = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation" }
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }

ENV.defaults["ruleType"] = "tokens"

{
  pattern: ( ( [ { ner:NUMBER } ] ) /floor(s?)/ ),
  action: ( Annotate($0, ner, "FLOORS"), Annotate($0, normalized, $$1.text) ) 
}

The rules above only set the NormalizedNER fields in the output to the text value of the number, "4" and "four" for the above examples respectively. Is there a way to use the NUMBER entity's normalized value ("4.0" both for "4" and "four") as the normalized value for my "FLOORS" entity?

Thanks in advance.

Upvotes: 0

Views: 677

Answers (3)

mmihaltz
mmihaltz

Reputation: 101

The correct answer is based on @AngelChang's answer and comment, I'm just posting it here for the sake of ordeliness.

The rule has to be modified so the 2nd Annotate() action's 3rd parameter is $1[0].normalized:

{
  pattern: ( ( [ { ner:NUMBER } ] ) /floor(s?)/ ),
  action: ( Annotate($0, ner, "FLOORS"), Annotate($0, normalized, $1[0].normalized) ) 
}

According to @Angel's comment:

$1[0].normalized is the "normalized" field of the 0th token of the 1st capture group (as a CoreLabel). The $$1 gives you back the MatchedGroupInfo which has the "text" field but not the normalized field (since that is on the actual token)

Upvotes: 0

Omer Hassan
Omer Hassan

Reputation: 161

With $$1.normalized as you suggested, running on the input "The building has seven floors" yields the following error message: Annotating file test.txt { Error extracting annotation from seven floors }

It might be because the NamedEntityTagAnnotation key is not already present for the token represented by $$1. I suppose, before running TokensRegex, you'd want to make sure that your numeric tokens - either "four" or "4" in this case - have the corresponding normalized value - "4.0" in this case - set to their NamedEntityTagAnnotation key.

Also, could you please direct me to where I can find more information on the possible 3rd arguments of Annotate()? Your Javadoc page for TokensRegex expressions doesn't list $$n.normalized (perhaps it needs updating?)

I believe, that what $$n.normalized would do, would be to retrieve the value which, in Java code, would be the equivalent of coreLabel.get(edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation.class) where coreLabel is of type CoreLabel and corresponds with $$n in TokensRegex. This is because of the following line in your TokensRegex: normalized = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation" }

Upvotes: 0

Angel Chang
Angel Chang

Reputation: 364

Try changing

action: ( Annotate($0, ner, "FLOORS"), Annotate($0, normalized, $$1.text) )

to

action: ( Annotate($0, ner, "FLOORS"), Annotate($0, normalized, $$1.normalized) )

Annotate takes three arguments

  • arg1 = object to annotate (typically the matched tokens indicated by $0)
  • arg2 = annotation field
  • arg3 = value (in this case you want the NormalizedNER field instead of the text field)

Upvotes: 1

Related Questions