gpupu
gpupu

Reputation: 65

HtmlAgilityPack, get a sequence of nodes with a label

Imagine an Html document similar to this

   <div>
      <div>...</div>
      <table>...</table>
      <p>...</p>
      <p>...</p>
      <p>...</p>
      <table>...</table>
      <p>...</p>
      <div>...</div>
      <p>...</p>
      <p>...</p>
    </div>

And I would like to take the first sequence of paragraphs nodes. I have tried to iterate over the node collection of p's checking nextSibling until find a name different to p, but this is always text.

More specifically, what I want is to get the first part of text from a wikipedia page. I mean, all the paragraphs before find a non paragraph like a table of content or the end of the page on other pages. In the example before, I would like to take the HtmlDocument with the first three paragraphs.

I could do this converting to a string, and using IndexOf. However I prefer a more generic solution because I don't know what I am going to find in wikipedia pages.

Upvotes: 1

Views: 742

Answers (1)

jessehouwing
jessehouwing

Reputation: 115057

You can use use SkipWhile and TakeWhile in combination with the list of children from the div.

 var children = doc.DocumentNode.SelectNodes("/div/*");
 var paragraphs = children
      .SkipWhile(child => !string.Equals(child.Name, "p", StringComparison.OrdinalIgnoreCase))
      .TakeWhile(child => string.Equals(child.Name, "p", StringComparison.OrdinalIgnoreCase));

Upvotes: 1

Related Questions