Vamsi Unique
Vamsi Unique

Reputation: 67

Check Alphanumeric of specific size UIMA RUTA

I am new to UIMA RUTA. I am trying to implement a basic task which should match Alphanumeric characters of specific size. Eg: 123Abcd

I tried the below code:

DECLARE VarA;
ANY{REGEXP("([A-Za-z0-9]{7})")->MARK(VarA)};

It's not working as expected. Please let me know what I am doing wrong. The same REGEX is working in other REGEX engine except in RUTA.

Thanks in advance.

Upvotes: 1

Views: 276

Answers (1)

Jasper Huzen
Jasper Huzen

Reputation: 1573

This is because Ruta split up the document in small fragments/tokens/Basic Annotations (see this). The default seeder implementation splits up words if they are a combination of a number and characters. The default seeder implementation can be changed by your own seeder with a different behaviour.

Your example "123Abcd" will be parsed to the following tokens (not all levels are in the list - see link for more information):

Document -> Complete document "123Abcd"
NUM -> 123
CW -> Abcd

Another example of input "45 abcd 5" becomes:

Document -> Complete document "45 abcd 5"
NUM -> 45
SPACE -> The spacer between 45 and abcd // Not visible by default
SW -> abcd
SPACE -> The spacer between abcd and 5 // Not visible by default
NUM -> 5

In your example you try to match the regular expression to the Any tokens. The document contains 2 Any tokens (NUM and CW) and because that the pattern doesn't match (it's not 1 token but splitted up)

You can do things like the following example to get the correct results:

DECLARE VarA, VarB, VarC, VarD;

// Option 1 (execute regex on the complete input document
// I think this is not a good solution because this can be slow
Document{REGEXP("([A-Za-z0-9]{7})") -> MARK(VarA)};

// Option 2 (match with regex on each annotation type)
(NUM{REGEXP("[0-9]{3}")} CW{REGEXP("[a-zA-Z]{4}")}){ -> MARK(VarB)};

// Option 3 (first match a pattern of annotations and then match the 
// regex on the complete pattern)
(NUM CW){REGEXP("([A-Za-z0-9]{7})") -> MARK(VarC)};

// Option 4 (only check if its a "number + capital word")
(NUM CW){ -> MARK(VarD)};

Upvotes: 3

Related Questions