Jacqueline
Jacqueline

Reputation: 491

HtmlAgilityPack and large HTML Documents

I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU.

I used dotTrace to see what the problem could be and it pointed me towards my httpwebrequest method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there.

I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :) So, now i am 99% certain it has to do with the following piece of code:

HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection list;
HtmlAgilityPack.HtmlNodeCollection frameList;

documentt.LoadHtml(_html);
list = documentt.DocumentNode.SelectNodes(".//a[@href]");

All that i want to do is to extract the links on the page, so for large sites.. is there anyway i can get this to not use so much CPU?

I was thinking maybe limit what i fetch? What would be my best option here?

Certainly someone must have run into this problem before :)

Upvotes: 2

Views: 3720

Answers (3)

Jamie Treworgy
Jamie Treworgy

Reputation: 24334

If you aren't heavily invested in Html Agility Pack, try using CsQuery instead. It builds an index when parsing the documents, and selectors are much faster than HTML Agility Pack. See a comparison.

CsQuery is a .NET jQuery port with a full CSS selector engine; it lets you use CSS selectors as well as the jQuery API to access and manipulate HTML. It's on nuget as CsQuery.

Upvotes: 1

Jacob Proffitt
Jacob Proffitt

Reputation: 12768

Have you tried dropping the XPath and using the LINQ functionality?

var list = documentt.DocumentNode.Descendants("a").Select(n => n.GetAttributeValue("href", string.Empty);

That'll pull a list of the href attribute of all anchor tags as a List<string>.

Upvotes: 1

Alexei Levenkov
Alexei Levenkov

Reputation: 100547

".//a[@href]" is extremely slow XPath. Tried to replace with "//a[@href]" or with code that simply walks whole document and checks all A nodes.

Why this XPath is slow:

  1. "." starting with a node
  2. "//" select all descendent nodes
  3. "a" - pick only "a" nodes
  4. "@href" with href.

Portion 1+2 ends up with "for every node select all its descendant nodes" which is very slow.

Upvotes: 0

Related Questions