user1203653
user1203653

Reputation: 11

How to remove html tags and contents in between them in c#?

I want to remove the tag and contents between them from my source..

Following is my source:

<tr>
  <td class="ds_label" width="40%" style="font-size: 70%;"></td>
  <td id="table_cell_1585" class="ds_label">
    <a class="tt" href="#" onClick="return false;">
      <table class="tooltip" style="width:300px;" cellpadding="0" cellspacing="0" border=0>
    </a>
  </td>
  <td class="ds_data" width="60%" style="font-size: 70%">800 x 480 pixels</td>
</tr>

And i want to remove whole <a> tag with content.

I used this: response contains my source code.

response = Regex.Replace(response, "<a>(.|\n)*?</a>", string.Empty);

but it's not working.

Please advise.

Upvotes: 1

Views: 3784

Answers (4)

Romesh Somani
Romesh Somani

Reputation: 373

use this

variable = Server.HtmlDecode(variable).Trim();

Upvotes: 1

Sam Greenhalgh
Sam Greenhalgh

Reputation: 6136

  • Firstly, try to avoid using regex to work with HTML, it's the wrong tool because there are too many edge cases to be reliable or secure. Use a framework designed to work with a structured document like the HTMLAgilityPack.

  • When you are using literal strings to define a regular expression in c# it's a good idea to use a verbatim string literal (prefixed with @) so escape caracters in the pattern arn't interpreted as part of the literal string. In the case of this question @"<a>(.|\n)*?</a>" will stop the \n from being treated as an escape character in c#.

  • New lines can consist of both \r and-or \n

  • HTML A tags contain attributes like href so <a> is unlikely to match anything because of the closing >

  • Use RegexOptions.Singleline in the options argument to ensure . matches any character including newlines.


This unit test succeeds.

[Test]
public void Test()
{
    Regex pattern = new Regex(@"<a.*?</a>", RegexOptions.Singleline);
    string input = "foo <a href=\"//example.com\">\r\nbaz</a> bar";
    string expected = "foo  bar";
    string actual = pattern.Replace(input, string.Empty);
    Assert.AreEqual(expected, actual);
}

However, be aware that this is not a secure way of handling user input or any kind of data that is not pre-defined because regular expressions like this can easily be evaded.

Upvotes: 2

Zasz
Zasz

Reputation: 12528

Try this regex :

<a\b[^>]*>(.*?)</a>

    [TestMethod]
        public void TestMethod1()
        {
            var source =
                @"
<tr>
  <td class='ds_label' width='40%' style='font-size: 70%;\'></td>
  <td id='table_cell_1585' class='ds_label'>
    <a class='tt' href='#' onClick='return false;'>
      <table class='tooltip' style='width:300px;' cellpadding='0' cellspacing='0' border=0>
    </a>
  </td>
  <td class='ds_data' width='60%' style='font-size: 70%'>800 x 480 pixels</td>
</tr>";
            source = Regex.Replace(source, "<a [^>]*>", string.Empty);
            source = Regex.Replace(source, "</a>", string.Empty);
            Console.Write(source);
        }

Upvotes: 0

Bryan Crosby
Bryan Crosby

Reputation: 6554

Regex is not a good tool for parsing HTML. Take a look at HTMLAgilityPack instead to save yourself some work.

Upvotes: 2

Related Questions