InfoStatus
InfoStatus

Reputation: 7113

.NET Regular Expressions in Infinite Cycle

I'm using .NET Regular Expressions to strip HTML code.

Using something like:

<title>(?<Title>[\w\W]+?)</title>[\w\W]+?<div class="article">(?<Text>[\w\W]+?)</div>

This works for 99% of the time, but sometimes, when parsing...

Regex.IsMatch(HTML, Pattern)

The parser just blocks and it will continue on this line of code for several minutes or indefinitely.

What's going on?

Upvotes: 1

Views: 329

Answers (3)

Jan Goyvaerts
Jan Goyvaerts

Reputation: 21999

Your regex will work just fine when your HTML string actually contains HTML that fits the pattern. But when your HTML does not fit the pattern, e.g. if the last tag is missing, your regex will exhibit what I call "catastrophic backtracking". Click that link and scroll down to the "Quickly Matching a Complete HTML File" section. It describes your problem exactly. [\w\W]+? is a complicated way of saying .+? with RegexOptions.SingleLine.

Upvotes: 6

kͩeͣmͮpͥ ͩ
kͩeͣmͮpͥ ͩ

Reputation: 7846

You're asking your regex to do a lot there. After every character, it has to look ahead to see if the next bit of text can be matched with the next part of the pattern.

Regex is a pattern matching tool. Whilst you can use it for simple parsing, you'd be better off using a specific parser (such as the HTML Agility pack, as mentioned my Marc).

Upvotes: 1

Marc Gravell
Marc Gravell

Reputation: 1062705

With some effort, you can make regex work on html - however, have you looked at the HTML agility pack? This makes it much easier to work with html as a DOM, with support for xpath-type queries etc (i.e. "//div[@class='article']").

Upvotes: 3

Related Questions