Reputation: 2363
I'm using RUTA and wrote a lot of different rules for extracting the same entity. For example, I will extract the "toilet paper factory". At the moment my rules will result in toilet paper factory, paper factory and factory. But I'm only interested in the longest match.
I've created a minimal example:
DECLARE Test;
(CW CW) {-> CREATE(Test)};
(CW CW CW) {-> CREATE(Test)};
And my Test-String:
lower lower Upper Upper Upper lower Upper
The rules above will match Upper Upper and Upper Upper Upper. But in this case I'm only interested in the result of the last rule.
Is it possible to remove the shorter matches?
Upvotes: 2
Views: 209
Reputation: 3113
There are several options to avoid the additional matches and to remove the additionally created annotations.
You can remove the additional annotations with something like:
Test->{ANY t:@Test{-> UNMARK(t)};t:@Test{-> UNMARK(t)} ANY;};
This rule checks all Test annotations and applies two rules within that span. Each rule searches for a Test
annotation preceded or followed by anything, which means that this one is smaller than the first one. If matched, the annotation is removed.
There is also the PARTOFNEQ condition, but it is rather slow:
Test{PARTOFNEQ(Test)->UNMARK(Test)};
If you want to avoid the creation of the annotations, you need to change the order of the rules and apply the more specific one first. You can alter the matching process in many ways, e.g., with a PARTOF condition, MARKONCE action or setting GREEDYANCHORING.
An example:
(CW{-PARTOF(Test)} CW CW) {-> CREATE(Test)};
(CW{-PARTOF(Test)} CW) {-> CREATE(Test)};
In you example you could of course do something like:
CW[2,3]{-PARTOF(Test)-> Test};
but this is probably not the idea behind this question.
DISCLAIMER: I am a developer of UIMA Ruta
Upvotes: 1