C# HtmlAgilityPack - Scraping

Question

I want to use HtmlAgilityPack to scrape content from GSMArena.com, specifically, I want to scrape the technical specifications of cell phones.

Desired Outcome:

http://www.gsmarena.com/nokia_lumia_520-5322.php I would want to scrape the weight, dimensions, etc

Issue: The node path will be different between just about all models.

My Question:

How would I scrape by searching? For example, If I wanted to scrape the product weight, is there a way to tell HTMLAgilityPack to search for an tag, and then go to the TD that follows it, and then scrape the inner text of that TD?

Tyress · Accepted Answer

XPath is your friend. Learn it here. (In case of link rot, just Google an XPath 1.0 tutorial)

For that document:

   string weight= doc.DocumentNode.SelectSingleNode(@"//td[a[contains(text(),'Weight')]]/following-sibling::td").InnerText;

Will get you the weight.

Explanation for XPath: For all nodes (//) select "td" element which contains an "a" element that contains the text "Weight", and then select the following "td" node.

C# HtmlAgilityPack - Scraping

Answers (1)

Related Questions