Cong Luo
Cong Luo

Reputation: 47

Can scrapy skip the error of empty data and keeping scraping?

I want to scrape product pages from its sitemap, the products page are similar, but not all of them are the same.

for example

Product A https://www.vitalsource.com/products/environment-the-science-behind-the-stories-jay-h-withgott-matthew-v9780134446400

Product B https://www.vitalsource.com/products/abnormal-psychology-susan-nolen-hoeksema-v9781259765667

we can see the product A has the subtitle but another one doesn't have.

So I get errors when I trying to scrape all the product pages.

My question is, is there a way to let the spider skip the error for returning no data?

There is a simple way to bypass it. that is not using strip() But I am wondering if there is a better way to do the job.

import scrapy
import re
from VitalSource.items import VitalsourceItem
from scrapy.selector import Selector
from scrapy.spiders import SitemapSpider



class VsSpider(SitemapSpider):
    name = 'VS'
    allowed_domains = ['vitalsource.com']
    sitemap_urls = ['https://storage.googleapis.com/vst-stargate-production/sitemap/sitemap1.xml.gz']
    sitemap_rules = [
        ('/products/', 'parse_product'),
    ]
    def parse_product(self, response):
        selector = Selector(response=response)
        item = VitalsourceItem()
        item['Ebook_Title'] = response.css('.product-overview__title-header::text').extract()[1].strip
        item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").extract().strip
        print(item)
        return item

error message

    item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").extract().strip
AttributeError: 'list' object has no attribute 'strip'

Upvotes: 1

Views: 975

Answers (3)

vezunchik
vezunchik

Reputation: 3717

Since you need only one subtitle you can use get() with setting default value to empty string. This will save you from errors about applying strip() function to empty element.

item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").get('').strip()

Upvotes: 3

Matt Reynolds
Matt Reynolds

Reputation: 477

You could check if a value is returned before extracting:

if response.css("div.subtitle.subtitle-pdp::text"):
    item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").get().strip

That way the subTitle code line would only run if a value was to be returned...

Upvotes: 0

Granitosaurus
Granitosaurus

Reputation: 21446

In general scrapy will not stop crawling if callbacks raise an exception. e.g.:

def start_requests(self):
    for i in range(10):
        yield Requst(
            f'http://example.org/page/{i}',
            callback=self.parse,
            errback=self.errback,
        )

def parse(self, response):
    # first page 
    if 'page/1' in response.request.url:
        raise ValueError()
    yield {'url': response.url}

def errback(self, failure): 
    print(f"oh no, failed to parse {failure.request}")

In this example 10 requests will be made and 9 items will be scraped but 1 will fail and got o errback

In your case you have nothing to fear - any request that does not raise an exception will scrape as it should, for the ones that do you'll just see an exception traceback in your terminal/logs.

Upvotes: 0

Related Questions