How to parse text from anonymous block in AngleSharp?

Question

I'm parsing site content using AngleSharp and i've got an issue with anonymous block.

See the sample code:

var parser = new HtmlParser();
var document = parser.Parse(@"

    
    Hello, world
    1


    
    Yet another helloworld
    25

");

var products = document.QuerySelectorAll("div.product");
foreach (var product in products)
{
    var productTitle = product.Text();
    productTitle.Dump();
}

So, productTitle contains numbers from div.comments-likes, output is:

Hello, world 1

Yet another helloworld 25

I've tried something like product.FirstElementChild.NextElementSibling.Text(); but next sibling for link element is div.comments-likes, not anonymous block. It shows:

1

25

So, anonymous blocks are skipped. :(

The best workaround i've found is deleting all preventing blocks, for my example:

product.QuerySelector(".comments-likes").Remove();
var productTitle = product.Text().Trim();

Is better way for parsing text from anonymous block?

har07 · Accepted Answer

Text is modeled as a TextNode, it is a type of node beside element, comment node, processing instruction, etc. That's why NextElementSibling you tried didn't include the text in the result since it intended to return elements only, as the name suggests.

You can get text nodes located directly within product div by traversing through the div's ChildNodes and then filter by NodeType, for example :

var products = document.QuerySelectorAll("div.product");
foreach (var product in products)
{
    var productTitle = product.ChildNodes
                              .First(o => o.NodeType == AngleSharp.Dom.NodeType.Text 
                                            && o.TextContent.Trim() != "");
    Console.WriteLine(productTitle.TextContent.Trim());
}

dotnetfiddle demo

Notice that newlines between elements are also text nodes, so we need to filter those out in the demo above.

How to parse text from anonymous block in AngleSharp?

Answers (1)

Related Questions