Reputation: 53125
I've converted a large document from Word to HTML. It's close, but I have a bunch of "code" nodes that I'd like to merge into one "pre" node.
Here's the input:
<p>Here's a sample MVC Controller action:</p>
<code> public ActionResult Index()</code>
<code> {</code>
<code> return View();</code>
<code> }</code>
<p>We'll start by making the following changes...</p>
I want to turn it into this, instead:
<p>Here's a sample MVC Controller action:</p>
<pre class="brush: csharp"> public ActionResult Index()
{
return View();
}</pre>
<p>We'll start by making the following changes...</p>
I ended up writing a brute-force loop that iterates nodes looking for consecutive ones, but this seems ugly to me:
HtmlDocument doc = new HtmlDocument();
doc.Load(file);
var nodes = doc.DocumentNode.ChildNodes;
string contents = string.Empty;
foreach (HtmlNode node in nodes)
{
if (node.Name == "code")
{
contents += node.InnerText + Environment.NewLine;
if (node.NextSibling.Name != "code" &&
!(node.NextSibling.Name == "#text" && node.NextSibling.NextSibling.Name == "code")
)
{
node.Name = "pre";
node.Attributes.RemoveAll();
node.SetAttributeValue("class", "brush: csharp");
node.InnerHtml = contents;
contents = string.Empty;
}
}
}
nodes = doc.DocumentNode.SelectNodes(@"//code");
foreach (var node in nodes)
{
node.Remove();
}
Normally I'd remove the nodes in the first loop, but that doesn't work during iteration since you can't change the collection as you iterate over it.
Better ideas?
Upvotes: 0
Views: 3125
Reputation: 32333
The first approach: select all the <code>
nodes, group them, and create a <pre>
node per group:
var idx = 0;
var nodes = doc.DocumentNode
.SelectNodes("//code")
.GroupBy(n => new {
Parent = n.ParentNode,
Index = n.NextSiblingIsCode() ? idx : idx++
});
foreach (var group in nodes)
{
var pre = HtmlNode.CreateNode("<pre class='brush: csharp'></pre>");
pre.AppendChild(doc.CreateTextNode(
string.Join(Environment.NewLine, group.Select(g => g.InnerText))
));
group.Key.Parent.InsertBefore(pre, group.First());
foreach (var code in group)
code.Remove();
}
The grouping field here is combined field of a parent node and group index which is increased when new group is found.
Also I used NextSiblingIsCode
extension method here:
public static bool NextSiblingIsCode(this HtmlNode node)
{
return (node.NextSibling != null && node.NextSibling.Name == "code") ||
(node.NextSibling is HtmlTextNode &&
node.NextSibling.NextSibling != null &&
node.NextSibling.NextSibling.Name == "code");
}
It used to determine whether the next sibling is a <code>
node.
<code>
node of each group, then iterate through each of these nodes to find the next <code>
node until the first non-<code>
node. I used xpath
here:
var nodes = doc.DocumentNode.SelectNodes(
"//code[name(preceding-sibling::*[1])!='code']"
);
foreach (var node in nodes)
{
var pre = HtmlNode.CreateNode("<pre class='brush: csharp'></pre>");
node.ParentNode.InsertBefore(pre, node);
var content = string.Empty;
var next = node;
do
{
content += next.InnerText + Environment.NewLine;
var previous = next;
next = next.SelectSingleNode("following-sibling::*[1][name()='code']");
previous.Remove();
} while (next != null);
pre.AppendChild(doc.CreateTextNode(
content.TrimEnd(Environment.NewLine.ToCharArray())
));
}
Upvotes: 2
Reputation: 8098
Sanitize the html you want to parse. HTML Agility Pack strip tags NOT IN whitelist
Upvotes: 0