Reputation: 1091
I have a HTML document that contains lots of needless blank lines which I'd like to remove. Here's a sample of the HTML:
<html>
<head>
</head>
<body>
<h1>Heading</h1>
<p>Testing
I've tried the following code but it removed every newline, I just want to remove the ones that are blank lines.
static string RemoveLineReturns(string html)
{
html = html.Replace(Environment.NewLine, "");
return html;
}
Any idea how to do this with HTMLAgilityPack? Thanks, J.
Upvotes: 3
Views: 6714
Reputation: 258
In these days of Linq, I suggest the following:
Regex r = new Regex(@"\S", RegexOptions.Compiled);
var cleanHtml = string.Join( Environment.NewLine,
dirtyHtml.Split(new char[]{'\n', '\r'}, StringSplitOptions.RemoveEmptyEntries)
.Where(l => r.Matches(l).Count > 0) );
Upvotes: 0
Reputation: 89325
One possible way using Html Agility Pack :
var doc = new HtmlDocument();
//TODO: load your HtmlDocument here
//select all empty (containing white-space(s) only) text nodes :
var xpath = "//text()[not(normalize-space())]";
var emptyNodes = doc.DocumentNode.SelectNodes(xpath);
//replace each and all empty text nodes with single new-line text node
foreach (HtmlNode emptyNode in emptyNodes)
{
emptyNode.ParentNode
.ReplaceChild(HtmlTextNode.CreateNode(Environment.NewLine)
, emptyNode
);
}
Upvotes: 5
Reputation: 9564
I don't think that HTMLAgilityPack currently features a native solution for that.
For such scenarios I use the following Regex:
html = Regex.Replace(html, @"( |\t|\r?\n)\1+", "$1");
This preserves whitespaces and line endings correctly, while condensing multiple tabs, newlines and whitespaces into one.
Upvotes: 2