Zeeshan Jan
Zeeshan Jan

Reputation: 307

Learning Regular Expressions Backtracking in Google Chrome

I am trying to learn and understand the regular expressions in javascript and would like to understand the concept of backtracking for regular expressions in javascript. Can anyone point me to the source code or refer me to the algorithm which javascript in Google Chrome (V8 Engine) uses to parse regular expressions and to check how does it backtrack. As Google's V8 Engine is Open Source this must not be difficult.

Upvotes: 0

Views: 152

Answers (1)

nhahtdh
nhahtdh

Reputation: 56819

The source code of V8 Engine is not exactly a friendly place to start learning about backtracking.

At first glance and from my experience reading Java's implementation of Pattern class, the file trunk/src/x87/regexp-macro-assembler-x87.cc contains the source code of JS RegExp. You are basically reading assembly at the level present in the source code.

You might be interested in trunk/src/runtime/runtime-regexp.cc, which contains the implementation of methods of RegExp. The code doesn't contain anything about the inner working of RegExp, though.


The concept of backtracking is similar to search algorithms. Since you don't know the result, you would perform a brute force search, but depending on how you define the search order, you can arrive at the result faster or slower.

For concatenation, each node connects to the next one in a linear manner. There is no branching, so there is no backtracking.

For a branch P1|P2|...|Pn, you can think of it as a search tree with n nodes, where you will try the node P1 first, then P2, ... and finally Pn. All n nodes in the branch connects to the sequel atom, so if any of the node succeeds, it will move on to the sequel atom, and only backtrack for the next node in the branch when all possibilities on the next atom has been exhausted.

For (greedy) quantifier 0 or more A*, you can think of it as a node with 2 branches, one branch to A then back to itself, and the other branch going to the next atom. The branch to A will be tried first. Note that this is a simplified description, since the engine also has to deal with 0-length matches.

For (lazy) quantifier 0 or more A*?, it is basically the same as above, except that the branch to the next atom will be tried first.

When you add upper bound and lower bound to quantifier, you can imagine a counter being added to record how many times A has been matched. Depending on the counter, either of the branches will be unavailable for branching at certain counter.

So you will perform a search using the search tree above, and every time you got stuck (can't reach the goal state of the tree), you will backtrack and try the other branch.

Hope this help getting you started on the concept of backtracking.

Upvotes: 1

Related Questions