Reputation: 26402
I was reading this article today on two different regular expression algorithms.
According to the article old Unix tools like ed, sed, grep, egrep, awk, and lex, all use what's called the Thompson NFA algorithm in their regular expresssions...
However newer tools like Java, Perl, PHP, and Python all use a different algorithm for their regular expressions that are much, much slower.
This article makes no mention at all of Javascript's regex algorthim, (and yes I know there are various JS engines out there) but I was wondering if anybody knew which of those algorithms they use, and if maybe those algorithms should be swapped out for Thompson NFA.
Upvotes: 17
Views: 5090
Reputation: 22009
Though the ECMA standard does not specify the algorithm an ECMAScript implementation should use, the fact that the standard mandates that ECMAScript regular expressions must support backreferences (\1, \2, etc.) rules out the DFA and "Thompson NFA" implementations.
Upvotes: 10
Reputation: 112386
The Javascript ECMA language description doesn't impose a requirement for the particular implementation of regular expressions, so that part of the question isn't well-formed. You're really wondering about the particular implementation in a particular browser.
The reason Perl/Python etc use a slower algorithm, though, is that the regex language defined isn't really regular expressions. A real regular expression can be expressed as a finite state machine, but the language of regex is context free. That's why the fashion is to just call it "regex" instead of talking about regular expressions.
Yes, in fact javascript regex isn't content free regular. Consider the syntax using `{n,m}', that is, matches from n to m accepted regexs. Let d the difference d=|n-m|. The syntax means there exists a string uxdw that is acceptable, but a string uxk>dw that is not. It follows via the pumping lemma for regular languages that this is not a regular language.
(augh. Thinko corrected.)
Upvotes: 9
Reputation: 64919
Perl uses a memoized recursive backtracking search and, as of some improvements in 5.10, no longer blows up on perl -e '("a" x 100000) =~ /^(ab?)*$/;'
. In recent tests I performed on an OS X box, Perl 5.10 outperformed awk
, even in the cases where awk
's algorithm was supposed to be better.
Upvotes: 3