Xaisoft
Xaisoft

Reputation: 46641

Regular Expression does not remove html comment?

I have the following string:

<TD><!-- 1.91 -->6949<!-- 9.11 --></TD>

I want to end up with:

<TD>6949/TD>

but instead I end up with just the tags and no information:

<TD></TD>

This is the regular expression I am using:

RegEx.Replace("<TD><!-- 1.91 -->6949<!-- 9.11 --></TD>","<!--.*-->","")

Can someone explain how to keep the numbers and remove just what the comments. Also if possible, can someone explain why this is happening?

Upvotes: 0

Views: 340

Answers (3)

Mrchief
Mrchief

Reputation: 76238

Parsing HTML with Regex is always going to be tricky. Instead, use something like HTML Agility Pack which will allow you to query and parse html in a structured manner.

Upvotes: 2

zellio
zellio

Reputation: 32514

.* is greedy so it will match as many characters as possible. In this case the opening of the first comment until the end of the second. Changing it to .*? or [^>]* will fix it as the ? makes the match lazy. Which is to say it will match as few characters as possible.

Upvotes: 2

SLaks
SLaks

Reputation: 887877

.* is a greedy qualifier which matches as much as possible.
It's matching everything until the last -->.

Change it to .*?, which is a lazy qualifier.

Upvotes: 3

Related Questions