Warz
Warz

Reputation: 7776

Parse HTML by breaklines using HTML AgilityPack

I am trying to parse a specific HTML string so that i can extract a set of lines broken up by <br/> break lines. The input HTML looks like this:

<div class="PlainText">
  DATE: 2013-10-28 20:00:43 -0500 <br/>
  Item 1: Text1 <br/>
  Item 1: Text1 <br/>
  Item 1: Text1 <br/>
  Item 1: Text1 <br/>
  <br/> //Notice this has two break lines, i would like to stop after seeing two consecutive break lines.
</div>

With this div in a larger html document, i was able to get the HTML ChildNodes

List<HtmlNode> nodes = htmlDoc.DocumentNode
                                    .Descendants("div")
                                    .Where(x => x.Attributes.Contains("class") &&
                                            x.Attributes["class"].Value.Contains("PlainText")).ToList();

I am not entirely sure where to go from here, i would like to read all the text until i see two breaklines and stop ?

EDIT

I looked at the childNodes nodes in Visual Studio runtime inspector and noticed there actually isn't two consectuive <br/> lines but a single break line and a #text tag with its innerHTMl being \n a new line character.

enter image description here

Upvotes: 3

Views: 1334

Answers (2)

devshorts
devshorts

Reputation: 8872

Something like this should work

[Test]
public void Test()
{
    var x = ReadTillTwoBr(GetDivClass()).ToList();
}

public HtmlNode GetDivClass()
{
    var html = @"<html><div class=""PlainText"">
            DATE: 2013-10-28 20:00:43 -0500 <br/>
            Item 1: Text1 <br/>
            Item 1: Text1 <br/>
            Item 1: Text1 <br/>
            Item 1: Text1 <br/>
            <br   /> //Notice this has two break lines, i would like to stop after seeing two consecutive break lines.
            Item 3
        </div></html>";
    var doc = new HtmlDocument();
    doc.LoadHtml(html);

    return doc.DocumentNode
                .Descendants("div").First(x => x.Attributes.Contains("class") &&
                                                x.Attributes["class"].Value.Contains("PlainText"));

}

public IEnumerable<string> ReadTillTwoBr(HtmlNode node)
{
    var nonEmptyNodes =
        node.ChildNodes.Except(node.ChildNodes.Where(f => f.Name == "#text" && String.IsNullOrWhiteSpace(f.InnerHtml)))
            .ToList();

    foreach (var n in nonEmptyNodes)
    {
        if (IsBr(n) && IsBr(n.NextSibling))
        {
            yield break;
        }

        if (n.Name == "#text")
        {
            yield return n.InnerText.Trim();
        }
    }
}

public bool IsBr(HtmlNode n)
{
    return n != null && n.NodeType == HtmlNodeType.Element && n.Name == "br";
}

Which returns

enter image description here

Notice how it didn't return the comment after the two br's

EDIT:

I removed empty #text values since when you have a newline between the last two br tags you actually get a #text tag with newlines. I think this is where the newline confusion lay.

Upvotes: 0

Sergey Berezovskiy
Sergey Berezovskiy

Reputation: 236308

You can use XPath //div[@class='PlainText'] to get required div nodes. Also you can check next sibling node when taking child nodes from div:

HtmlDocument doc = new HtmlDocument();
doc.Load("index.html");
Func<HtmlNode, bool> notTwoBrakes = 
    n => (n.Name != "br" || n.NextSibling != null && n.NextSibling.Name != "br");
var nodes = doc.DocumentNode.SelectNodes("//div[@class='PlainText']")
               .Select(div => div.ChildNodes.TakeWhile(notTwoBrakes));

I don't use inline lambda just for readability. Condition works like this:

  • Check if next node is null, if it's null, then take current node
  • Check if next node is br node, if not - take current node
  • Check if current node is br node, if not - take current node
  • Otherwise stop taking child nodes

Result:

enter image description here

Upvotes: 1

Related Questions