Reputation: 758
I have an html document structured as:
<h3><a name="sect55">55</a></h3>
<p></p>
<p class="choice"><a href="#sect325"></a></p>
<h3><a name="sect56"></a></h3>
<p></p>
<p class="choice"><a href="#sect222"></a></p>
<h3><a name="sect57"></a></h3>
<p></p>
<p class="choice"><a href="#sect164"></a></p>
<p class="choice"><a href="#sect109"></a></p>
<p class="choice"><a href="#sect308"></a></p>
I want to retrieve, in a separate List, all the nodes until the next section, so until the next <h3>
.
For now I'm using:
for (int paragraph = xx; paragraph <= yy; paragraph++)
{
nameActual = "sect" + paragraph;
nameNext = "sect" + (paragraph + 1);
HtmlNodeCollection NodeOfParagraph = doc.DocumentNode.SelectNodes(String.Format("//h3[a[@name='{0}']]/following-sibling::p[following::h3/a[@name='{1}']]", nameActual, nameNext));
//Multiples actions on my NodeOfParagraph
}
So I select my first <h3>
that possesses an <a>
of the value I'm looking for, and I then select all the <p>
nodes that possess a following node with an <a>
of my next value.
It works, but takes a really long time, I suppose because for each node it tests all the other node for their value.
How can I improve my query performances ?
Upvotes: 1
Views: 368
Reputation: 16055
You could do the following:
var doc = new HtmlDocument();
doc.Load(@"path\to\file.html");
var sects = doc.DocumentNode.SelectNodes("//h3[a[starts-with(@name, 'sect')]]");
for (var index = 0; index < sects.Count; index ++)
{
var isLast = (index == sects.Count - 1);
var xpath = ".//following-sibling::p";
if (!isLast)
xpath += string.Format("[following-sibling::h3[1][a/@name = '{0}']]", sects[index + 1].SelectSingleNode("./a").Attributes["name"].Value);
var collection = sects[index].SelectNodes(xpath);
}
This will have the advantage of:
./
) so that unnecessary parts of the document are not searchedh3
(h3[1]
), so that unnecessary parts of the document are not searchedfollowing-sibling::
instead of following::
)Upvotes: 1