Reputation: 764
I'm parsing site content using AngleSharp and i've got an issue with anonymous block.
See the sample code:
var parser = new HtmlParser();
var document = parser.Parse(@"<body>
<div class='product'>
<a href='#'><img src='img1.jpg' alt=''></a>
Hello, world
<div class='comments-likes'>1</div>
</div>
<div class='product'>
<a href='#'><img src='img2.jpg' alt=''></a>
Yet another helloworld
<div class='comments-likes'>25</div>
</div>
<body>");
var products = document.QuerySelectorAll("div.product");
foreach (var product in products)
{
var productTitle = product.Text();
productTitle.Dump();
}
So, productTitle contains numbers from div.comments-likes, output is:
Hello, world 1
Yet another helloworld 25
I've tried something like product.FirstElementChild.NextElementSibling.Text();
but next sibling for link element is div.comments-likes, not anonymous block. It shows:
1
25
So, anonymous blocks are skipped. :(
The best workaround i've found is deleting all preventing blocks, for my example:
product.QuerySelector(".comments-likes").Remove();
var productTitle = product.Text().Trim();
Is better way for parsing text from anonymous block?
Upvotes: 5
Views: 1282
Reputation: 89285
Text is modeled as a TextNode
, it is a type of node beside element, comment node, processing instruction, etc. That's why NextElementSibling
you tried didn't include the text in the result since it intended to return elements only, as the name suggests.
You can get text nodes located directly within product div
by traversing through the div
's ChildNodes
and then filter by NodeType
, for example :
var products = document.QuerySelectorAll("div.product");
foreach (var product in products)
{
var productTitle = product.ChildNodes
.First(o => o.NodeType == AngleSharp.Dom.NodeType.Text
&& o.TextContent.Trim() != "");
Console.WriteLine(productTitle.TextContent.Trim());
}
Notice that newlines between elements are also text nodes, so we need to filter those out in the demo above.
Upvotes: 3