Reputation: 685
Is it possible to use regex to remove HTML tags inside a particular block of HTML?
E.g.
<body>
<p>Hello World!</p>
<table>
<tr>
<td>
<p>My First HTML Table</p>
</td>
</tr>
</table>
I don't want to remove all P tags, only those within the table element.
The ability to both remove or retain the text inside the nested p tag would be ideal.
Thanks.
Upvotes: 0
Views: 872
Reputation: 32343
There are a lot of mentions regarding not to use regex when parsing HTML, so you could use Html Agility Pack for this:
var html = @"
<body>
<p>Hello World!</p>
<table>
<tr>
<td>
<p>My First HTML Table</p>
</td>
</tr>
</table>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var nodes = document.DocumentNode.SelectNodes("//table//p");
foreach (HtmlNode node in nodes)
{
node.ParentNode.ReplaceChild(
HtmlNode.CreateNode(node.InnerHtml),
node
);
}
string result = null;
using (StringWriter writer = new StringWriter())
{
document.Save(writer);
result = writer.ToString();
}
So after all these manupulations, you'll get the next result
:
<body>
<p>Hello World!</p>
<table>
<tr>
<td>
My First HTML Table
</td>
</tr>
</table></body>
Upvotes: 5
Reputation: 49245
Possible to some extent but not reliable!
I will rather suggest you to look at HTML parsers such as HTML Agility Pack.
Upvotes: 0
Reputation: 14906
<td>[\r\n\s]*<p>([^<]*)</p>[\r\n\s]*</td>
The round brackets denote a numbered capture group which will contain your text.
However, using regular expressions in this way relies on a lot of assumptions regarding the content of the <p>
tag and the construction of the HTML.
Have a read of the ubiquitous SO question regarding using regular expressions to parse (X)HTML and see @Bruno's answer for a more robust solution.
Upvotes: 1
Reputation: 1974
I have found this link in which it seems the exact question was asked
"I have an HTML document in .txt format containing multiple tables and other texts and I am trying to delete any HTML (anything within "<>") if it's inside a table (between and ). For example:"
Regex to delete HTML within <table> tags
Upvotes: 1