adam-p
adam-p

Reputation: 680

End-of-string regex match too slow

Demo here. The regex:

([^>]+)$

I want to match text at the end of a HTML snippet that is not contained in a tag (i.e., a trailing text node). The regex above seems like the simplest match, but the execution time seems to scale linearly with the length of the match-text (and has causes hangs in the wild when used in my browser extension). It's also equally slow for matching and non-matching text.

Why is this seemingly simple regex so bad?

(I also tried RegexBuddy but can't seem to get an explanation from it.)

Edit: Here's a snippet for testing the various regexes (click "Run" in the console area).
Edit 2: And a no-match test.

Upvotes: 5

Views: 912

Answers (2)

georg
georg

Reputation: 214969

Consider an input like this

abc<def>xyz

With your original expression, ([^>]+)$, the engine starts from a, fails on >, backtracks, restarts from b, then from c etc. So yes, the time grows with size of the input. If, however, you force the engine to consume everything up to the rightmost > first, as in:

.+>([^>]+)$

the backtracking will be limited by the length of the last segment, no matter how much input is before it.

The second expression is not equivalent to the first one, but since you're using grouping, it doesn't matter much, just pick matches[1].

Hint: even when you target javascript, switch to the pcre mode, which gives you access to the step info and debugger:

enter image description here

(look at the green bars!)

Upvotes: 5

blex
blex

Reputation: 25634

You could use the actual DOM instead of Regex, which is time consuming:

var html = "<div><span>blabla</span></div><div>bla</div>Here I am !";

var temp = document.createElement('div');
temp.innerHTML = html;
var lastNode = temp.lastChild || false;

if(lastNode.nodeType == 3){
    alert(lastNode.nodeValue);
}

Upvotes: 1

Related Questions