Dealing with different html tags in the same item field with scrapy

Question

If my html contains a table where columns represent fields and rows represent records but some of the cells in the first column are just text and some are links, how can I get these to all go into the correct field? I could only think how to do it by omitting the first column all together and .

eg:



184A Kent St
House
$600,000
Auction
4
19/10/13


5/189 Rockingham Beach Rd
Unit
$502,000
Normal Sale
3
10/09/13

etc... etc...

My nasty solution that omits the first column (which happens to be the street address):

def parse(self, response):
       hxs = HtmlXPathSelector(response)
       items = []
       item = SoldItem()
       types = hxs.select('//table[@id="recentSales"]/tbody/tr/td[@class="center"]/text()').extract()
       beds = hxs.select('//table[@id="recentSales"]/tbody/tr/td[@class="center bed"]/text()').extract()
       prices = hxs.select('//table[@id="recentSales"]/tbody/tr/td[@class="currency price"]/text()').extract()
       dates =  hxs.select('//table[@id="recentSales"]/tbody/tr/td[@class="date"]/text()').extract()
       for i in range(len(types)):
           item['type'] = types[i]
           item['bed'] = beds[i]
           item['price'] = prices[i]
           item['saledate'] = dates[i]
           items.append(item)
       return items
       pass

Any help appreciated. Thanks

paul trmbrth · Accepted Answer

I suggest you loop on table rows tr elements. hxs.select() will return a list of selectors, on which you can continue to use .select() with other relative XPath expressions within the context of each row.

To get the text content of the first td cell of each row, whether there's a nested link or not, you can use the .//text() pattern to extract all descendant text nodes, not only direct children (as with ./text())

Also, you need to instantiate an SoldItem() for each iteration of the loop. Try something like that:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    items = []

    rows = hxs.select('//table[@id="recentSales"]/tbody/tr)
    for row in rows:
        item = SoldItem()
        item['address'] = row.select('td[1]//text()').extract()
        item['saletype'] = row.select('td[4]/text()').extract()
        item['type'] = row.select('td[@class="center"]/text()').extract()
        item['bed'] = row.select('td[@class="center bed"]/text()').extract()
        item['price'] = row.select('td[@class="currency price"]/text()').extract()
        item['saledate'] = row.select('td[@class="date"]/text()').extract()
        items.append(item)
    return items

Dealing with different html tags in the same item field with scrapy

Answers (1)

Related Questions