Reputation: 25
I'm trying to use scrapy in order to exrtact information from an html table and store them into database. The information is stored into rows and there is no way to distinct one record from the other. (the site I'm crawling is the http://www.ets.gr/frontoffice/portal.asp?cpage=NODE&cnode=12).
How can I loop to every row of the table and get information if the form of:
Record1: tr[1] and tr[2] (skip tr[3])
Record2: tr[4] and tr[5] (skip tr[6])
Record3: tr[7] and tr[8] (skip tr[9])
and so...?
The nodes I'm getting in order to loop for each one are:
nodes = hxs.xpath("//table/tr/td/table/tr/td/table/tr/td/table/tr")
Upvotes: 0
Views: 605
Reputation: 38732
Constructing these results is not possible using XPath 1.0 (and that's all scrapy supports), you will have to use Python code for that (after pulling the information using XPath).
If you want to omit the third/sixth/... row from the start, use position()
and modulo:
//table/tr/td/table/tr/td/table/tr/td/table/tr[(position() mod 3) != 0]
Alternatively, use the @valign
attribute like metaphy proposed:
//table/tr/td/table/tr/td/table/tr/td/table/tr[@valign = 'top']
Upvotes: 2