Reputation: 65
I'm facing a strange issue while trying to scrape this URL:
To perform the crawling, I designed this:
class IkeaSpider(CrawlSpider) :
name = "Ikea"
allower_domains = ["http://www.ikea.com/"]
start_urls = ["http://www.ikea.com/fr/fr/catalog/productsaz/8/"]
rules = (
Rule(SgmlLinkExtractor(allow=[r'.*/catalog/products/\d+']),
callback='parse_page',
follow=True),
)
logging.basicConfig(filename='example.log',level=logging.ERROR)
def parse_page(self, response):
for sel in response.xpath('//div[@class="rightContent"]'):
Blah blah blah
I launch the spider from the command-line, and I can see urls normally scraped, but, for some of them, the callback doesn't work (about half of them are normally scrapped).
As there is more than 150 links on this page, it may explain why the crawler is missing callbacks (too many jobs). Does some of you have any idea regarding this?
This is the log :
2015-12-25 09:02:55 [scrapy] INFO: Stored csv feed (107 items) in: test.csv 2015-12-25 09:02:55 [scrapy] INFO: Dumping Scrapy stats: 'downloader/request_bytes': 68554, 'downloader/request_count': 217, 'downloader/request_method_count/GET': 217, 'downloader/response_bytes': 4577452, 'downloader/response_count': 217, 'downloader/response_status_count/200': 216, 'downloader/response_status_count/404': 1, 'dupefilter/filtered': 107, 'file_count': 106, 'file_status_count/downloaded': 106, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 12, 25, 8, 2, 55, 548350), 'item_scraped_count': 107, 'log_count/DEBUG': 433, 'log_count/ERROR': 2, 'log_count/INFO': 8, 'log_count/WARNING': 1, 'request_depth_max': 2, 'response_received_count': 217, 'scheduler/dequeued': 110, 'scheduler/dequeued/memory': 110, 'scheduler/enqueued': 110, 'scheduler/enqueued/memory': 110, 'start_time': datetime.datetime(2015, 12, 25, 8, 2, 28, 656959) 2015-12-25 09:02:55 [scrapy] INFO: Spider closed (finished
Upvotes: 1
Views: 401
Reputation: 1035
I'm not a fan of these auto spider classes. I usually just build exactly what I need.
import scrapy
class IkeaSpider(scrapy.Spider) :
name = "Ikea"
allower_domains = ["http://www.ikea.com/"]
start_urls = ["https://www.ikea.com/fr/fr/cat/produits-products/"]
logging.basicConfig(filename='example.log',level=logging.ERROR)
def parse(self, response):
# You could also use a.vn-nav__link::attr(href) selector.
for link in response.css('a:contains("/fr/cat/")::attr(href)').getall()
yield scrapy.Request(link, callback=self.parse_category)
def parse_category(self, response):
# parse items or potential sub categories
Upvotes: 0
Reputation: 65
I've read a lot of things regarding my problem, and, apparently, the CrawlSpider
class is not specific enough. It might explain why it misses some links, for some reasons I can't explain.
Basically, it is advised to use the BaseSpider
class with start_requests
and make_requests_from_url
method to do the job in a more specific way.
I am still not completely sure on how to do it precisely. that was just a hint.
Upvotes: 0