Jeff Putz
Jeff Putz

Reputation: 14907

.NET Regex fails to match in code, works in every testing harness

This one is a real head scratcher for me...

var matches = Regex.Matches("<p>test something<script language=\"javascript\">alert('hello');</script> and here's <b>bold</b> and <i>italic</i> and <a href=\"http://popw.com/\">link</a>.</p>", "</?(?!p|a|b|i)\b[^>]*>");

The Regex is supposed to capture any HTML tag (open or close) that's not p, a, b, or i. I've plugged the input string and regex into countless testing pages, and every one of them return the script tag (open and close) as matches. But it absolutely doesn't work in the code. The matches variable has a count of 0.

Am I missing something incredibly obvious?

Upvotes: 4

Views: 537

Answers (2)

Markus Jarderot
Markus Jarderot

Reputation: 89171

(?! ) is a negative look-ahead. It matches zero characters if it's contained pattern does not match from the current position.

(?!p|a|b|i)\\b will look at the next character to see if it matches p|a|b|i. If it does, the look-ahead fails to match anything. If the contained pattern fails to match, the look-ahead succeeds, and it tries to match the next token in the pattern from the same position. In this case a word boundary.

What you want is probably something like this:

@"</?(?!(?:p|a|b|i)\b)\w+[^>]*>"

It looks ahead for something that matches (?:p|a|b|i)\b. If the that pattern fails to match, the look-ahead succeeds, and it will match at least one word-character, followed by any number of characters up until the closing ">".

Upvotes: 0

Guffa
Guffa

Reputation: 700342

You forgot to escape the backslash in the pattern string.

"</?(?!p|a|b|i)\\b[^>]*>"

Upvotes: 8

Related Questions