Reputation: 109

Parsing HTML with XPath, Python and Scrapy

I am writing a Scrapy program to extract the data.

This is the url, and I want to scrape 20111028013117 (code) information. I have taken XPath from FireFox add-on XPather. This is the path:

/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]

While I am trying to execute this

try:
    temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
    print "temp_list:" + str(temp_list)
except:
    print "error"

It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.

Upvotes: 3

Answers (4)

Sjaak Trekhaak

Reputation: 4966

You can extract data with more ease using more robust XPaths instead of taking the direct output from XPather.

For the data you are matching, this XPath would do a lot better:

//font[contains(text(),'Code')]/parent::td/following-sibling::td/font/text()

This will match the <font> tag containing "Code", then go to the td tag above it and select the next td -> font, which contains the code you are looking for.

Upvotes: 2

matiskay

Reputation: 267

The reason of why your xpath doesn't work is becuase of tbody. You have to remove it and check if you get that result that you want.

You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html

Firefox, in particular, is known for adding <tbody> elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use <tbody> in your XPath expressions.

Upvotes: 9

warvariuc

Reputation: 59674

I see that the element you are hunting for is inside a <table>.

Firefox adds tbody tag for every table, even if it does not exists in source HTML code. That's might be the reason, that your xpath query works in the browser, but fails in Scrapy.

As suggested, use other anchors in your xpath query.

Upvotes: 3

halfer

Reputation: 20467

Have you tried removing a few node tags at the end of the query, and re-running until you get a result? Do this several times until you get something, then add items back in cautiously until the query is rectified.

Also, check that your target page validates as XHTML - an invalid page would probably upset the parser.

Upvotes: 1

Parsing HTML with XPath, Python and Scrapy

Answers (4)

Related Questions