Reputation: 46341
I've come across an article that shows that regular expression matching is commonly implemented using a potentially under-performing algorithm vs. the suggested Thompson NFA algorithm.
With that in mind, how is this implemented in Node or V8? Is it possible to improve performance using a JS implementation of Thompson NFA, perhaps if only a limited subset of the features is used (perhaps removal of lookahead or other "advanced" features)?
Upvotes: 3
Views: 3082
Reputation: 13105
As mentioned in this announcement from Chrome's development team, V8 engine uses Irregexp regular expression engine:
Here are some quotes on the implementation of this engine:
A fundamental decision we made early in the design of Irregexp was that we would be willing to spend extra time compiling a regular expression if that would make running it faster. During compilation Irregexp first converts a regexp into an intermediate automaton representation. This is in many ways the "natural" and most accessible representation and makes it much easier to analyze and optimize the regexp. For instance, when compiling /Sun|Mon/ the automaton representation lets us recognize that both alternatives have an 'n' as their third character. We can quickly scan the input until we find an 'n' and then start to match the regexp two characters earlier. Irregexp looks up to four characters ahead and matches up to four characters at a time.
After optimization we generate native machine code which uses backtracking to try different alternatives. Backtracking can be time-consuming so we use optimizations to avoid as much of it as we can. There are techniques to avoid backtracking altogether but the nature of regexps in JavaScript makes it difficult to apply them in our case, though it is something we may implement in the future.
So V8 does compile to a native automaton representation - though it does not use Thompson NFA.
As to performance, this article compares V8 regexp performance with other libraries/languages.
Upvotes: 11