Reputation: 764
I want to use HtmlAgilityPack to scrape content from GSMArena.com, specifically, I want to scrape the technical specifications of cell phones.
Desired Outcome:
http://www.gsmarena.com/nokia_lumia_520-5322.php I would want to scrape the weight, dimensions, etc
Issue: The node path will be different between just about all models.
My Question:
How would I scrape by searching? For example, If I wanted to scrape the product weight, is there a way to tell HTMLAgilityPack to search for an tag, and then go to the TD that follows it, and then scrape the inner text of that TD?
Upvotes: 0
Views: 459
Reputation: 3653
XPath is your friend. Learn it here. (In case of link rot, just Google an XPath 1.0 tutorial)
For that document:
string weight= doc.DocumentNode.SelectSingleNode(@"//td[a[contains(text(),'Weight')]]/following-sibling::td").InnerText;
Will get you the weight.
Explanation for XPath: For all nodes (//) select "td" element which contains an "a" element that contains the text "Weight", and then select the following "td" node.
Upvotes: 2