roryok
roryok

Reputation: 9645

Putting orphaned text into tags with HTMLAgilityPack

How do I convert the syntax of a piece of html like this

<div>
     some text
     <br/>
     goes in here
     <br/>
     with only br tags
     <br/>
     to separate it
     <br/>
</div>

to this

<div>
     <p>some text</p>
     <p>goes in here</p>
     <p>with only br tags</p>
     <p>to separate it</p>
</div>

using HTML Agility Pack in c#?

Upvotes: 2

Views: 609

Answers (2)

roryok
roryok

Reputation: 9645

I took a slightly different approach, treating the innerHTML of the div as text, I split it using <br>. It's a bit of a hack but it works.

var html = @"<div>
     some text
     <br/>
     goes in here
     <br/>
     with only br tags
     <br/>
     to separate it
     <br/>
</div>";

var doc = new HtmlDocument();
doc.LoadHtml(html);

var divs = doc.DocumentNode.Descendants("div");
//select all non-empty text nodes within <div>

foreach (var div in divs)
{
    // create a list of p nodes
    var ps = new List<HtmlNode>();

    // split text by "<br>"
    var texts = div.InnerHtml.Split(new string[]{ "<br>" }, StringSplitOptions.None);

    // iterate over split text
    foreach (var text in texts)
    {
        // if the line is not empty, add it to the collection
        if (!string.IsNullOrEmpty(text.Trim()))
        {
            var p = doc.CreateElement("p");
            p.AppendChild(doc.CreateTextNode(text));
            ps.Add(p);
        }
    }

    // join the p collection and paste it into the div
    div.InnerHtml = string.Join("", ps.Select(x => x.OuterHtml));
}

Upvotes: 0

har07
har07

Reputation: 89335

One possible way :

var html = @"<div>
     some text
     <br/>
     goes in here
     <br/>
     with only br tags
     <br/>
     to separate it
     <br/>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var div = doc.DocumentNode.SelectSingleNode("div");
//select all non-empty text nodes within <div>
var texts = div.SelectNodes("./text()[normalize-space()]");
foreach (var text in texts)
{
    //remove current text node
    text.Remove();
    //replace with : <p>current text node content</p>
    var p = doc.CreateElement("p");
    p.AppendChild(doc.CreateTextNode(text.InnerText));
    div.PrependChild(p);
}
//remove all <br/> tags within <div>
foreach (var selectNode in div.SelectNodes("./br"))
{
    selectNode.Remove();
}
//print result
Console.WriteLine(doc.DocumentNode.OuterHtml);

Upvotes: 1

Related Questions