user3778893
user3778893

Reputation: 125

UIMA RUTA : regular expression in WORDLIST

Is there any way to have regular expressions in WORDLIST? I need to implement the same as mentioned in https://issues.apache.org/jira/browse/UIMA-3382.

Or is there any alternate way to resolve it?

EDIT : WORDLIST is defined as a list of text items. What if I have a list of regular expressions that I want to mark as the same type. Is there a way to do it?

for e.g. - I want to find date in document, but there is a number of format for date, so regular expressions are a more concise way to cover all possible cases. So I was trying to use syntax below, but the only matches were for those cases where there was a single word without special regex syntax.

DECLARE Date;
WORDLIST DateFormatList='DateFormat.regex';
Document{-> MARKFAST(Date, DateFormat, true,1)};

What can I change in the rules so that the items in DateFormatList are treated as regular expressions?

Thanks

Upvotes: 4

Views: 614

Answers (1)

Peter Kluegl
Peter Kluegl

Reputation: 3113

Regular expressions in wordlists will not be supported in near future, if not a volunteer implements it. The problem is that wordlists use a trie and not an FST for the lookup process, which makes the desired functionality not straightforward to implement.

It is possible to simulate the desired functionality with wordlists in some rare situations, e.g., for optional sequences.

If you want to detect dates, I would acutally recommend to use the normal rules in UIMA Ruta. It's easier to combine and exploit stuff. The common example is a very simple rule for this:

ANY{INLIST(MonthsList) -> MARK(Month), MARK(Date,1,3)} 
PERIOD? NUM{REGEXP(".{2,4}") -> MARK(Year)};

If you want to stick to regular expressions, then you can use a list of simple regexp rules:

"regexp1" -> Date;
"regexp2" -> Date;
"regexp3" -> Date;

These rules also support feature assignments and capturing groups. The difference to the functionality that you want to use consists in the syntax (several rules instead of a simple list) and in the performance (the regular expressions are applied sequentially).

(I am a developer of UIMA Ruta)

Upvotes: 1

Related Questions