Jack Edmonds
Jack Edmonds

Reputation: 33171

How do I test against a large number of regular expressions quickly and know which one matched?

I'm writing a program in .net where the user may provide a large number of regular expressions. For a given string, I need to figure out which regular expression matches that string (if more than one matches, I just need the first one that matches). However, if there are a large number of regular expressions this operation can take a very long time.

I was somewhat hoping there would be something similar to flex (The Fast Lexical Analyzer (not Adobe Flex)) for .net that would allow me to specify a large number of regular expressions yet quickly (O(n) according to Wikipedia for n = len(input string)) figure out which regular expression matches.

Also, I would prefer not to implement my own regular expression engine :).

Upvotes: 3

Views: 978

Answers (3)

Mark Byers
Mark Byers

Reputation: 838156

What? Even testing if a single regular expression matches cannot be done in O(n) time in general. Where did you get this information from? What feature is this in Flex? I am sure it must be some limited form of regular expressions, not for arbitrary .NET regular expressions.

To handle arbitrary regular expressions the simple way is to keep your regular expressions stored in a List and simply iterate over each regular expression one by one until you find one that matches.

Upvotes: 1

Christian Semrau
Christian Semrau

Reputation: 9013

A quick web search shows that there exists a lex like tool named C#Lex. But since I do not use .NET or C#, I cannot say whether it is good, and if it is useful for you.

For Java, I found JLex and JFlex, both of which generate source code. Using these seems only reasonable if the regular expressions are literally compiled "off-line" (outside the application), and then incorporated into your application classpath. The .NET version probably behaves similarly.

Upvotes: 0

Charles
Charles

Reputation: 11479

Find the biggest chunk of constant text in each regex (if above a certain threshold length) and use the Karp-Rabin algorithm to search for any of those strings simultaneously. For each match, run that regex to see if the whole thing matches. For each regex not included in the multi string search, search that regex directly.

This should give you good performance for a large number of regular expressions if they have reasonable-length constant substrings, assuming you have preprocessing time available for the regular expressions.

Upvotes: 1

Related Questions