Ashish Gupta
Ashish Gupta

Reputation: 15139

Regex gives compiler error

<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>

Above is the regex which I took from Remove all empty HTML tags? and I am trying to use the same below :-

string regex= @"<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>";

And I get many compile errors like - newline in constant, unrecognized escape sequence on the above line itself.

Could anybody help me by pointing what am I missing?

Upvotes: 2

Views: 243

Answers (3)

Mike Samuel
Mike Samuel

Reputation: 120516

You have double quotes inside the regexp that need to be quoted.

 string regex= @"<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>";

should be

string regex= @"<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:\u0022[^\u0022]*\u0022|'[^']*'|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>";

Btw, because of the <\/\1\s*> this will only remove balanced tags surrounding space. It will match <p> </p> but not <img src=bogus onerror=alert(1337)>.

Even if all you want to do is remove balanced tags around space, be aware that this will not match all such tags. Specifically, it will not match tags where the tag name varies by case: <p> </P>.

Finally, it will not remove transitively empty tags: <i><b></b></i> -> <i></i>.

Upvotes: 1

Andrew Cooper
Andrew Cooper

Reputation: 32576

You need to use "" for double quotes inside the string:

string regex= @"<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:""[^""]*""|'[^']*'|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>";

Upvotes: 2

Bala R
Bala R

Reputation: 108957

Single double quotes( " ) have to be escaped with double double quotes ( "" ) in verbatim strings.

Try this

string regex= @"<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:""[^""]*""|'[^']*'|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>";

Upvotes: 0

Related Questions