How do I reduce the number of try/catch statements here?

Question

I'm currently working with Scrapy to pull company information from a website. However, the amount of data provided across the pages is vastly different; say, one company lists three of its team members, while another only lists two, or one company lists where its located, while another doesn't. Therefore, some XPaths may return null, so attempting to access them results in errors:

try: 
    item['industry'] = hxs.xpath('//*[@id="overview"]/div[2]/div[2]/p/text()[2]').extract()[0]
except IndexError:
    item['industry'] = "None provided"
try:
    item['URL'] = hxs.xpath('//*[@id="ContentPlaceHolder_lnkWebsite"]/text()').extract()[0]
except IndexError:
    item['URL'] = "None provided"
try:
    item['desc'] = hxs.xpath('//*[@id="overview"]/div[2]/div[4]/p/text()[1]').extract()[0]
except IndexError:
    item['desc'] = "None provided"
try:
    item['founded'] = hxs.xpath('//*[@id="ContentPlaceHolder_updSummary"]/div/div[2]/table/tbody/tr/td[1]/text()').extract()[0]
except IndexError:
    item['founded'] = "None provided"

My code uses many try/catch statements. Since each exception is specific to the field I am trying to populate, is there a cleaner way of working around this?

alecxe · Accepted Answer

Use TakeFirst() output processor:

Returns the first non-null/non-empty value from the values received, so it’s typically used as an output processor to single-valued fields.

from scrapy.contrib.loader.processor import TakeFirst

class MyItem(Item):
    industry = Field(output_processor=TakeFirst())
    ...

Then, inside the spider, you would not need try/catch at all:

item['industry'] = hxs.xpath('//*[@id="overview"]/div[2]/div[2]/p/text()[2]').extract()

How do I reduce the number of try/catch statements here?

Answers (2)

Related Questions