Turo
Turo

Reputation: 1607

Scrapy - handle exception when one of item fields is not returned

I'm trying to parse Scrapy items, where each of them has several fields. It happens that some of the fields cannot be properly captured due to incomplete information on the site. In case just one of the fields cannot be returned, the entire operation of extracting an item breaks with an exception (e.g. for below code I get "Attribute:None cannot be split"). The parser then moves to next request, without capturing other fields that were available.

item['prodcode'] = response.xpath('//head/title').re_first(r'.....').split(" ")[1]
#throws: Attribute:None cannot be split . Does not parse other fields.

What is the way of handling such exceptions by Scrapy? I would like to retrieve information from all fields that were available, while the unavailable ones return a blank or N/A. I could do try... except... on each of the item fields, but this seems like not the best solution. The docs mention exception handling, but somehow I cannot find a way for this case.

Upvotes: 6

Views: 2467

Answers (1)

alecxe
alecxe

Reputation: 473763

The most naive approach here would be to follow the EAFP approach and handle exceptions directly in the spider. For instance:

try:
    item['prodcode'] = response.xpath('//head/title').re_first(r'.....').split(" ")[1]
except AttributeError:
    item['prodcode'] = 'n/a'

A better option here could be to delegate the item field parsing logic to Item Loaders and different Input and Output Processors. So that your spider would be only responsible for parsing HTML and extracting the desired data but all of the post-processing and prettifying would be handled by an Item Loader. In other words, in your spider, you would only have:

loader = MyItemLoader(response=response)

# ...
loader.add_xpath("prodcode", "//head/title", re=r'.....')
# ...

loader.load_item()

And the Item Loader would have something like:

def parse_title(title):
    try:
        return title.split(" ")[1]
    except Exception:  # FIXME: handle more specific exceptions
        return 'n/a'

class MyItemLoader(ItemLoader):  
    default_output_processor = TakeFirst()

    prodcode_in = MapCompose(parse_title)

Upvotes: 7

Related Questions