Reputation: 491
I have built a little crawler and now when trying it out i found that when crawling certain sites my crawler uses 98-99% CPU.
I used dotTrace
to see what the problem could be and it pointed me towards my httpwebrequest
method - i optimised it a bit with the help of some previous questions here on stackoverflow.. but the problem was still there.
I then went to see what URLs that were causing the CPU load and found that it was actually sites that are extremely large in size - go figure :) So, now i am 99% certain it has to do with the following piece of code:
HtmlAgilityPack.HtmlDocument documentt = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection list;
HtmlAgilityPack.HtmlNodeCollection frameList;
documentt.LoadHtml(_html);
list = documentt.DocumentNode.SelectNodes(".//a[@href]");
All that i want to do is to extract the links on the page, so for large sites.. is there anyway i can get this to not use so much CPU?
I was thinking maybe limit what i fetch? What would be my best option here?
Certainly someone must have run into this problem before :)
Upvotes: 2
Views: 3720
Reputation: 24334
If you aren't heavily invested in Html Agility Pack, try using CsQuery instead. It builds an index when parsing the documents, and selectors are much faster than HTML Agility Pack. See a comparison.
CsQuery is a .NET jQuery port with a full CSS selector engine; it lets you use CSS selectors as well as the jQuery API to access and manipulate HTML. It's on nuget as CsQuery.
Upvotes: 1
Reputation: 12768
Have you tried dropping the XPath and using the LINQ functionality?
var list = documentt.DocumentNode.Descendants("a").Select(n => n.GetAttributeValue("href", string.Empty);
That'll pull a list of the href attribute of all anchor tags as a List<string>.
Upvotes: 1
Reputation: 100547
".//a[@href]" is extremely slow XPath. Tried to replace with "//a[@href]" or with code that simply walks whole document and checks all A nodes.
Why this XPath is slow:
Portion 1+2 ends up with "for every node select all its descendant nodes" which is very slow.
Upvotes: 0