Reputation: 178
Now this one is quite of a challenge on me.
I've got around 1000 queries in a file, all of a similar pattern that goes as:
***\*XYZ#PQR#\****
Now, where # indicates any-number non-whitespace charecters.
I've already coded a piece that can read a the above line and generate an according regex.
However, there are around 100,000 candidates and as I mentioned around 1000 such queries to be evaluated for the match.
This is making my code quite computationally expensive, as it is going to the order of m*n.
I've been through ANTLR and I found that the learning curve is quite steep. Although it sounds quite promising, I am still doubtful in some corner of my mind if this can by achieved using Antlr. Please let me know of your opinion or Any other viable solution.
Upvotes: 1
Views: 718
Reputation: 178
Done with it. With Regex, it took an hour, With Lucene, WildCardQueries and a booleanQuery to handle the permutations, Work Done in 11 minutes. *Wishes If one could have a timeline to learn Flex in a week. But Lucene is a good option for large DataSets, Regex and Crunching. It may not always solve your problem, but its just another solution.
Upvotes: 1
Reputation: 95324
It seems to me that you have what amounts to thousands of individual regular expressions,r1, r2, ... r1000 which identify some fixed set (much smaller than the number of individual regular expressions) of outcomes A, B, C, ...
In that that case, you can logically combine the regular expressions a1, a2, ... an for outcome A, and b1, ... bm for outcome B. (The ability to disjunctively compose regular expressions and get regular expressions is a well know theoretical property of regular expressions).
Most systems for expressing regular expressions (maybe not yours) allow you to write this as
a1 | a2 | .. | an --> A
or some equivalent syntax. Such systems are often associated with so-called lexer generators which allow compiler writers to express the fine-grain syntax of tokens in terms of characters.
A great advantage of such tools is that the effort to match (all the regular expressions for) the tokens is often sublinear with respect to the number of regular expressions, made possible by computing a finite-state machine in which prefixes shared by some set of regular expressions is recognized just once for the set. This can mean huge speedups and applies directly to situations such as yours.
The widely available tool FLEX does this very efficiently. ANTLR has some kind of mechanism for recognizing tokens expressed as regular expressions, but I don't know if it generate efficient finite state matchers.
Upvotes: 2
Reputation: 1543
I think there is no need in ANTLR, as simple string find and replace is possible: #
-> \\.*
. Asterisks should be removed.
So for *Telecom#Servic#*
you got Telecom\\.*Servic\\.*
. You can also add $ and ^ for checking beginning/end of the string.
Upvotes: 0