Reputation: 5153

Is there any way to optimize a generic regular expression?

I code in Eclipse, and when I do a CTRL-F to find some string, I see that apart from the standardized options of whole word, case sensitive, there is an option for regular expression search also (it is there in Notepad++ too).

I have tried it once or twice, and generally the results are almost instantaneous. But after all, the code files are not humongous, the biggest ones are not more than 500 lines long, with most lines filled less than half. Is there any way to optimize such that any user supplied regex will run much faster on a large piece of text, say 10-15 MB of size?

I can't think of any method because no standardized search algorithm like Rabin-Karp, or suffix tree would apply here!

Upvotes: 0

Answers (2)

fuch

Reputation: 156

I have no idea on how regular expression is implemented in Eclipse and why it is so slow. Here is just some thoughts:

First of all, there are a few concepts you should know: Nondeterministic finite automaton (NFA) and Deterministic finite automaton (DFA). In theory, Regular Expression, NFA, and DFA are equivalent, which means they have exactly the same ability to describe languages (sequences of characters). This implies that any one of them can be converted to another (see this site).

Regular Expression can be implemented by converting it to DFA, and using DFA to match text only takes linear time (many of the string matching algorithms, e.g. KMP, are actually special DFAs). However, the trouble is, most of modern Regular Expression implementations introduced features like backreferences making it impossible to use DFA.

So, if discarding those complex features is possible, implementing a fast Regular Expression would be feasible (do the matching in linear time). You may find more in this article.

Upvotes: 1

Akinakes

Reputation: 657

What makes you think suffix tree isn't a suitable algorithm for this problem? From http://en.wikipedia.org/wiki/Suffix_tree:

Once [the suffix tree is] constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc.

I think a modified Boyer–Moore string search algorithm also would be possible.

Upvotes: 0

Is there any way to optimize a generic regular expression?

Answers (2)

Related Questions