Jamie Carruthers
Jamie Carruthers

Reputation: 685

Remove HTML with Regex

Is it possible to use regex to remove HTML tags inside a particular block of HTML?

E.g.

<body>

<p>Hello World!</p>

<table>
    <tr>
        <td> 
          <p>My First HTML Table</p>
        </td>
    </tr>
</table>

I don't want to remove all P tags, only those within the table element.

The ability to both remove or retain the text inside the nested p tag would be ideal.

Thanks.

Upvotes: 0

Views: 872

Answers (4)

Oleks
Oleks

Reputation: 32343

There are a lot of mentions regarding not to use regex when parsing HTML, so you could use Html Agility Pack for this:

var html = @"
<body>

<p>Hello World!</p>

<table>
    <tr>
        <td> 
          <p>My First HTML Table</p>
        </td>
    </tr>
</table>";

HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);

var nodes = document.DocumentNode.SelectNodes("//table//p");
foreach (HtmlNode node in nodes)
{
    node.ParentNode.ReplaceChild(
        HtmlNode.CreateNode(node.InnerHtml),
        node
    );
}

string result = null;
using (StringWriter writer = new StringWriter())
{
    document.Save(writer);
    result = writer.ToString();
}

So after all these manupulations, you'll get the next result:

<body>

<p>Hello World!</p>

<table>
    <tr>
        <td> 
          My First HTML Table
        </td>
    </tr>
</table></body>

Upvotes: 5

VinayC
VinayC

Reputation: 49245

Possible to some extent but not reliable!

I will rather suggest you to look at HTML parsers such as HTML Agility Pack.

Upvotes: 0

Town
Town

Reputation: 14906

<td>[\r\n\s]*<p>([^<]*)</p>[\r\n\s]*</td>

The round brackets denote a numbered capture group which will contain your text.

However, using regular expressions in this way relies on a lot of assumptions regarding the content of the <p> tag and the construction of the HTML.

Have a read of the ubiquitous SO question regarding using regular expressions to parse (X)HTML and see @Bruno's answer for a more robust solution.

Upvotes: 1

Bruno
Bruno

Reputation: 1974

I have found this link in which it seems the exact question was asked

"I have an HTML document in .txt format containing multiple tables and other texts and I am trying to delete any HTML (anything within "<>") if it's inside a table (between and ). For example:"

Regex to delete HTML within <table> tags

Upvotes: 1

Related Questions