Reputation: 13
I am new to jython and scrapy, but I am impressed by the capabilities that is has. My question is, what is the best way to extract data when the XPaths are the same?
For example:
<tr>
<td>
<a href="/user/Bob">Bob Job</a>
</td>
<td>hi</td>
<td>280.0</td>
</tr>
I need to scrape the information from all 3 td fields. I use firebug to extract the XPath which displays my XPath as
/html/body/table[2]/tbody/tr/td[2]/div/table/tbody/tr[2]/td[3]
what is the best way to extract data when the XPaths are the same? I may only need data from td[1] and td[3].
Upvotes: 1
Views: 1426
Reputation: 18529
You will have to identify a criteria to extract the values and put them in respective item fields. e.g.
link = hxs.select('//td/a/href').extract()[0]
linktext = hxs.select('//td/a/text()').extract()[0]
number = hxs.select('//td').re('\d+\.\d+')
Upvotes: 1
Reputation: 53850
Firebugs copy xpath isn't always optimal.
When scraping tables, first find a way to iterate the <TR>
fields like //table[@id='results']/tr
, then do another query to grab the td fields you need for each row. //td
Simpler that way.
Upvotes: 0