Reputation: 65
Imagine an Html document similar to this
<div>
<div>...</div>
<table>...</table>
<p>...</p>
<p>...</p>
<p>...</p>
<table>...</table>
<p>...</p>
<div>...</div>
<p>...</p>
<p>...</p>
</div>
And I would like to take the first sequence of paragraphs nodes. I have tried to iterate over the node collection of p's checking nextSibling
until find a name different to p, but this is always text.
More specifically, what I want is to get the first part of text from a wikipedia page. I mean, all the paragraphs before find a non paragraph like a table of content or the end of the page on other pages. In the example before, I would like to take the HtmlDocument with the first three paragraphs.
I could do this converting to a string, and using IndexOf
. However I prefer a more generic solution because I don't know what I am going to find in wikipedia pages.
Upvotes: 1
Views: 742
Reputation: 115057
You can use use SkipWhile and TakeWhile in combination with the list of children from the div.
var children = doc.DocumentNode.SelectNodes("/div/*");
var paragraphs = children
.SkipWhile(child => !string.Equals(child.Name, "p", StringComparison.OrdinalIgnoreCase))
.TakeWhile(child => string.Equals(child.Name, "p", StringComparison.OrdinalIgnoreCase));
Upvotes: 1