Mark
Mark

Reputation: 195

Dealing with different html tags in the same item field with scrapy

If my html contains a table where columns represent fields and rows represent records but some of the cells in the first column are just text and some are links, how can I get these to all go into the correct field? I could only think how to do it by omitting the first column all together and .

eg:

<tbody>
<tr id="ps_134922471">
<td><a href="/114911935">184A Kent St</a></td>
<td class="center">House</td>
<td class="currency price">$600,000</td>
<td>Auction</td>
<td class="center bed">4</td>
<td class="date">19/10/13</td>
</tr>
<tr id="ps_134922515">
<td>5/189 Rockingham Beach Rd</td>
<td class="center">Unit</td>
<td class="currency price">$502,000</td>
<td>Normal Sale</td>
<td class="center bed">3</td>
<td class="date">10/09/13</td>
</tr>

etc... etc...

My nasty solution that omits the first column (which happens to be the street address):

def parse(self, response):
       hxs = HtmlXPathSelector(response)
       items = []
       item = SoldItem()
       types = hxs.select('//table[@id="recentSales"]/tbody/tr/td[@class="center"]/text()').extract()
       beds = hxs.select('//table[@id="recentSales"]/tbody/tr/td[@class="center bed"]/text()').extract()
       prices = hxs.select('//table[@id="recentSales"]/tbody/tr/td[@class="currency price"]/text()').extract()
       dates =  hxs.select('//table[@id="recentSales"]/tbody/tr/td[@class="date"]/text()').extract()
       for i in range(len(types)):
           item['type'] = types[i]
           item['bed'] = beds[i]
           item['price'] = prices[i]
           item['saledate'] = dates[i]
           items.append(item)
       return items
       pass

Any help appreciated. Thanks

Upvotes: 1

Views: 147

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

I suggest you loop on table rows tr elements. hxs.select() will return a list of selectors, on which you can continue to use .select() with other relative XPath expressions within the context of each row.

To get the text content of the first td cell of each row, whether there's a nested link or not, you can use the .//text() pattern to extract all descendant text nodes, not only direct children (as with ./text())

Also, you need to instantiate an SoldItem() for each iteration of the loop. Try something like that:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    items = []

    rows = hxs.select('//table[@id="recentSales"]/tbody/tr)
    for row in rows:
        item = SoldItem()
        item['address'] = row.select('td[1]//text()').extract()
        item['saletype'] = row.select('td[4]/text()').extract()
        item['type'] = row.select('td[@class="center"]/text()').extract()
        item['bed'] = row.select('td[@class="center bed"]/text()').extract()
        item['price'] = row.select('td[@class="currency price"]/text()').extract()
        item['saledate'] = row.select('td[@class="date"]/text()').extract()
        items.append(item)
    return items

Upvotes: 1

Related Questions