Reputation: 109
I am writing a Scrapy program to extract the data.
This is the url, and I want to scrape 20111028013117
(code) information. I have taken XPath from FireFox add-on XPather. This is the path:
/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]
While I am trying to execute this
try:
temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
print "temp_list:" + str(temp_list)
except:
print "error"
It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.
Upvotes: 3
Views: 3159
Reputation: 4966
You can extract data with more ease using more robust XPaths instead of taking the direct output from XPather.
For the data you are matching, this XPath would do a lot better:
//font[contains(text(),'Code')]/parent::td/following-sibling::td/font/text()
This will match the <font>
tag containing "Code", then go to the td
tag above it and select the next td
-> font
, which contains the code you are looking for.
Upvotes: 2
Reputation: 267
The reason of why your xpath doesn't work is becuase of tbody
. You have to remove it and check if you get that result that you want.
You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding
<tbody>
elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use<tbody>
in your XPath expressions.
Upvotes: 9
Reputation: 59574
I see that the element you are hunting for is inside a <table>
.
Firefox adds tbody
tag for every table, even if it does not exists in source HTML code.
That's might be the reason, that your xpath query works in the browser, but fails in Scrapy.
As suggested, use other anchors in your xpath query.
Upvotes: 3
Reputation: 20420
Have you tried removing a few node tags at the end of the query, and re-running until you get a result? Do this several times until you get something, then add items back in cautiously until the query is rectified.
Also, check that your target page validates as XHTML - an invalid page would probably upset the parser.
Upvotes: 1