sajiang
sajiang

Reputation: 60

How do I reduce the number of try/catch statements here?

I'm currently working with Scrapy to pull company information from a website. However, the amount of data provided across the pages is vastly different; say, one company lists three of its team members, while another only lists two, or one company lists where its located, while another doesn't. Therefore, some XPaths may return null, so attempting to access them results in errors:

try: 
    item['industry'] = hxs.xpath('//*[@id="overview"]/div[2]/div[2]/p/text()[2]').extract()[0]
except IndexError:
    item['industry'] = "None provided"
try:
    item['URL'] = hxs.xpath('//*[@id="ContentPlaceHolder_lnkWebsite"]/text()').extract()[0]
except IndexError:
    item['URL'] = "None provided"
try:
    item['desc'] = hxs.xpath('//*[@id="overview"]/div[2]/div[4]/p/text()[1]').extract()[0]
except IndexError:
    item['desc'] = "None provided"
try:
    item['founded'] = hxs.xpath('//*[@id="ContentPlaceHolder_updSummary"]/div/div[2]/table/tbody/tr/td[1]/text()').extract()[0]
except IndexError:
    item['founded'] = "None provided"

My code uses many try/catch statements. Since each exception is specific to the field I am trying to populate, is there a cleaner way of working around this?

Upvotes: 1

Views: 762

Answers (2)

JBJ
JBJ

Reputation: 1109

In the latest version extract-first()use used for this. It returns None if search doesn't return anything. Thus you will have no errors.

Upvotes: 0

alecxe
alecxe

Reputation: 474201

Use TakeFirst() output processor:

Returns the first non-null/non-empty value from the values received, so it’s typically used as an output processor to single-valued fields.

from scrapy.contrib.loader.processor import TakeFirst

class MyItem(Item):
    industry = Field(output_processor=TakeFirst())
    ...

Then, inside the spider, you would not need try/catch at all:

item['industry'] = hxs.xpath('//*[@id="overview"]/div[2]/div[2]/p/text()[2]').extract()

Upvotes: 7

Related Questions