Reputation: 195
If my html contains a table where columns represent fields and rows represent records but some of the cells in the first column are just text and some are links, how can I get these to all go into the correct field? I could only think how to do it by omitting the first column all together and .
eg:
<tbody>
<tr id="ps_134922471">
<td><a href="/114911935">184A Kent St</a></td>
<td class="center">House</td>
<td class="currency price">$600,000</td>
<td>Auction</td>
<td class="center bed">4</td>
<td class="date">19/10/13</td>
</tr>
<tr id="ps_134922515">
<td>5/189 Rockingham Beach Rd</td>
<td class="center">Unit</td>
<td class="currency price">$502,000</td>
<td>Normal Sale</td>
<td class="center bed">3</td>
<td class="date">10/09/13</td>
</tr>
etc... etc...
My nasty solution that omits the first column (which happens to be the street address):
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = SoldItem()
types = hxs.select('//table[@id="recentSales"]/tbody/tr/td[@class="center"]/text()').extract()
beds = hxs.select('//table[@id="recentSales"]/tbody/tr/td[@class="center bed"]/text()').extract()
prices = hxs.select('//table[@id="recentSales"]/tbody/tr/td[@class="currency price"]/text()').extract()
dates = hxs.select('//table[@id="recentSales"]/tbody/tr/td[@class="date"]/text()').extract()
for i in range(len(types)):
item['type'] = types[i]
item['bed'] = beds[i]
item['price'] = prices[i]
item['saledate'] = dates[i]
items.append(item)
return items
pass
Any help appreciated. Thanks
Upvotes: 1
Views: 147
Reputation: 20748
I suggest you loop on table rows tr
elements. hxs.select()
will return a list of selectors, on which you can continue to use .select()
with other relative XPath expressions within the context of each row.
To get the text content of the first td
cell of each row, whether there's a nested link or not, you can use the .//text()
pattern to extract all descendant text nodes, not only direct children (as with ./text()
)
Also, you need to instantiate an SoldItem()
for each iteration of the loop.
Try something like that:
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
rows = hxs.select('//table[@id="recentSales"]/tbody/tr)
for row in rows:
item = SoldItem()
item['address'] = row.select('td[1]//text()').extract()
item['saletype'] = row.select('td[4]/text()').extract()
item['type'] = row.select('td[@class="center"]/text()').extract()
item['bed'] = row.select('td[@class="center bed"]/text()').extract()
item['price'] = row.select('td[@class="currency price"]/text()').extract()
item['saledate'] = row.select('td[@class="date"]/text()').extract()
items.append(item)
return items
Upvotes: 1