Reputation: 67
I am new to UIMA RUTA. I am trying to implement a basic task which should match Alphanumeric characters of specific size. Eg: 123Abcd
I tried the below code:
DECLARE VarA;
ANY{REGEXP("([A-Za-z0-9]{7})")->MARK(VarA)};
It's not working as expected. Please let me know what I am doing wrong. The same REGEX is working in other REGEX engine except in RUTA.
Thanks in advance.
Upvotes: 1
Views: 276
Reputation: 1573
This is because Ruta split up the document in small fragments/tokens/Basic Annotations (see this). The default seeder implementation splits up words if they are a combination of a number and characters. The default seeder implementation can be changed by your own seeder with a different behaviour.
Your example "123Abcd" will be parsed to the following tokens (not all levels are in the list - see link for more information):
Document -> Complete document "123Abcd"
NUM -> 123
CW -> Abcd
Another example of input "45 abcd 5" becomes:
Document -> Complete document "45 abcd 5"
NUM -> 45
SPACE -> The spacer between 45 and abcd // Not visible by default
SW -> abcd
SPACE -> The spacer between abcd and 5 // Not visible by default
NUM -> 5
In your example you try to match the regular expression to the Any tokens. The document contains 2 Any tokens (NUM and CW) and because that the pattern doesn't match (it's not 1 token but splitted up)
You can do things like the following example to get the correct results:
DECLARE VarA, VarB, VarC, VarD;
// Option 1 (execute regex on the complete input document
// I think this is not a good solution because this can be slow
Document{REGEXP("([A-Za-z0-9]{7})") -> MARK(VarA)};
// Option 2 (match with regex on each annotation type)
(NUM{REGEXP("[0-9]{3}")} CW{REGEXP("[a-zA-Z]{4}")}){ -> MARK(VarB)};
// Option 3 (first match a pattern of annotations and then match the
// regex on the complete pattern)
(NUM CW){REGEXP("([A-Za-z0-9]{7})") -> MARK(VarC)};
// Option 4 (only check if its a "number + capital word")
(NUM CW){ -> MARK(VarD)};
Upvotes: 3