Lan
Lan

Reputation: 1204

Algorithmic complexity of Regular Languages in Extended Regular Language frameworks

I have a bit of a background in Formal Languages and recently I found out that Java and other languages use what is coined Extended Regular Languages. Because of my background, I had always assumed in languages like Java when I call compile for Pattern it produced a DFA or Transducer in the background. As a result, I had always assumed no matter how ugly my regex, no matter how long my regex, Pattern.matches or similar methods would run in linear time. But that assumption seems to be incorrect.

A post I read seems to suggest some Regex expressions do run in linear time, but I don't entirely believe or trust one single person.

I'll eventually write my own Java Formal Regular Expression library (existing ones I've found only have GNU GPL licences) but in the meantime I have a few questions on the time complexity of Java/C# regexs. Want to ensure what I read elsewhere is correct.

Questions:

  1. A formal language like \sRT\s., would the match method of a regexes in Java/C# solve membership in linear or non-linear time?
  2. In general, how would I know if a given Regular Language's expression membership problem is linear time for a Regex?

I do text analysis, finding out Java regexes aren't DFA's was really a downer.

Upvotes: 8

Views: 690

Answers (3)

Christopher Z
Christopher Z

Reputation: 952

Java's implementation of regular expression uses the NFA approach. Here is a link that explains it clearly.

Basically, a badly written but still correct regular expression can cause the engine to perform badly. For example, given the expression (a+a+)+b and the string aaaaaaaaaaaaaaa. It can take a while (depending on your machine, from several seconds to minutes) to figure out that there is no match.

The worst case performance of NFA is an almost match situation (given example). Because the expression forces the engine to explore every path (with lots of backtracking) to determine a non-match.

Upvotes: 2

Eric Lippert
Eric Lippert

Reputation: 660417

Because of my background, I had always assumed in languages like Java when I call compile for Pattern it produced a DFA or Transducer in the background.

This belief is common in academics. In practice, regular expression compilation does not produce a DFA and then execute it. I have only a small amount of experience in this; I worked briefly on the regular expression compilation system in Microsoft's implementation of JavaScript in the 1990s. We chose to compile the "regular" expression into a simple domain-specific bytecode language and then built an interpreter for that language.

As you note, this can lead to situations where repeated backtracking has exponentially bad time behavior in the length of the input, but the construction of the compiled state is essentially linear in the size of the expression.

So let me counter your questions with another two questions -- and I note that these are genuine questions, not rhetorical.

1) Every actually regular expression corresponds to an NDFA with let's say n states. The corresponding DFA might require superposition of up to 2n states. So what stops the time taken to construct the DFA from being exponential in pathological cases? The runtime might be linear in the input, but if the runtime is exponential in the pattern size then basically you're just trading one nonlinearity for another.

2) So-called "regular" expressions these days are nothing of the sort; they can do parenthesis matching. They correspond to pushdown automata, not nondeterministic finite automata. Is there a linear algorithm for constructing the corresponding pushdown automaton for a "regular" expression?

Upvotes: 6

Sean
Sean

Reputation: 4470

Regular expressions are implemented in many languages as NFAs to support backtracking (see http://msdn.microsoft.com/en-us/library/e347654k(v=vs.110).aspx). Because of backtracking, you can construct regular expressions which have terrible performance on some strings (see http://www.regular-expressions.info/catastrophic.html)

As far as analysing regular expressions to determine their performance, I doubt there is a very good way in general. You can look for warning flags, like compounded backtracking in the second link's example, but even that may be hard to detect properly in some cases.

Upvotes: 3

Related Questions