Reputation: 47
I want to scrape product pages from its sitemap, the products page are similar, but not all of them are the same.
for example
Product B https://www.vitalsource.com/products/abnormal-psychology-susan-nolen-hoeksema-v9781259765667
we can see the product A has the subtitle but another one doesn't have.
So I get errors when I trying to scrape all the product pages.
My question is, is there a way to let the spider skip the error for returning no data?
There is a simple way to bypass it. that is not using strip() But I am wondering if there is a better way to do the job.
import scrapy
import re
from VitalSource.items import VitalsourceItem
from scrapy.selector import Selector
from scrapy.spiders import SitemapSpider
class VsSpider(SitemapSpider):
name = 'VS'
allowed_domains = ['vitalsource.com']
sitemap_urls = ['https://storage.googleapis.com/vst-stargate-production/sitemap/sitemap1.xml.gz']
sitemap_rules = [
('/products/', 'parse_product'),
]
def parse_product(self, response):
selector = Selector(response=response)
item = VitalsourceItem()
item['Ebook_Title'] = response.css('.product-overview__title-header::text').extract()[1].strip
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").extract().strip
print(item)
return item
error message
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").extract().strip
AttributeError: 'list' object has no attribute 'strip'
Upvotes: 1
Views: 975
Reputation: 3717
Since you need only one subtitle you can use get()
with setting default value to empty string. This will save you from errors about applying strip()
function to empty element.
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").get('').strip()
Upvotes: 3
Reputation: 477
You could check if a value is returned before extracting:
if response.css("div.subtitle.subtitle-pdp::text"):
item['Ebook_SubTitle'] = response.css("div.subtitle.subtitle-pdp::text").get().strip
That way the subTitle code line would only run if a value was to be returned...
Upvotes: 0
Reputation: 21446
In general scrapy will not stop crawling if callbacks raise an exception. e.g.:
def start_requests(self):
for i in range(10):
yield Requst(
f'http://example.org/page/{i}',
callback=self.parse,
errback=self.errback,
)
def parse(self, response):
# first page
if 'page/1' in response.request.url:
raise ValueError()
yield {'url': response.url}
def errback(self, failure):
print(f"oh no, failed to parse {failure.request}")
In this example 10 requests will be made and 9 items will be scraped but 1 will fail and got o errback
In your case you have nothing to fear - any request that does not raise an exception will scrape as it should, for the ones that do you'll just see an exception traceback in your terminal/logs.
Upvotes: 0