Bobafotz
Bobafotz

Reputation: 65

Scrapy crawler extract urls but miss half callbacks

I'm facing a strange issue while trying to scrape this URL:

To perform the crawling, I designed this:

class IkeaSpider(CrawlSpider) :

    name = "Ikea"
    allower_domains = ["http://www.ikea.com/"]
    start_urls = ["http://www.ikea.com/fr/fr/catalog/productsaz/8/"]

    rules = (
        Rule(SgmlLinkExtractor(allow=[r'.*/catalog/products/\d+']),
            callback='parse_page',
            follow=True),
            )

    logging.basicConfig(filename='example.log',level=logging.ERROR)

        def parse_page(self, response):

            for sel in response.xpath('//div[@class="rightContent"]'):

                 Blah blah blah

I launch the spider from the command-line, and I can see urls normally scraped, but, for some of them, the callback doesn't work (about half of them are normally scrapped).

As there is more than 150 links on this page, it may explain why the crawler is missing callbacks (too many jobs). Does some of you have any idea regarding this?

This is the log :

2015-12-25 09:02:55 [scrapy] INFO: Stored csv feed (107 items) in: test.csv 2015-12-25 09:02:55 [scrapy] INFO: Dumping Scrapy stats: 'downloader/request_bytes': 68554, 'downloader/request_count': 217, 'downloader/request_method_count/GET': 217, 'downloader/response_bytes': 4577452, 'downloader/response_count': 217, 'downloader/response_status_count/200': 216, 'downloader/response_status_count/404': 1, 'dupefilter/filtered': 107, 'file_count': 106, 'file_status_count/downloaded': 106, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 12, 25, 8, 2, 55, 548350), 'item_scraped_count': 107, 'log_count/DEBUG': 433, 'log_count/ERROR': 2, 'log_count/INFO': 8, 'log_count/WARNING': 1, 'request_depth_max': 2, 'response_received_count': 217, 'scheduler/dequeued': 110, 'scheduler/dequeued/memory': 110, 'scheduler/enqueued': 110, 'scheduler/enqueued/memory': 110, 'start_time': datetime.datetime(2015, 12, 25, 8, 2, 28, 656959) 2015-12-25 09:02:55 [scrapy] INFO: Spider closed (finished

Upvotes: 1

Views: 401

Answers (2)

ThePyGuy
ThePyGuy

Reputation: 1035

I'm not a fan of these auto spider classes. I usually just build exactly what I need.

import scrapy

class IkeaSpider(scrapy.Spider) :

    name = "Ikea"
    allower_domains = ["http://www.ikea.com/"]
    start_urls = ["https://www.ikea.com/fr/fr/cat/produits-products/"]

    logging.basicConfig(filename='example.log',level=logging.ERROR)

        def parse(self, response):
            # You could also use a.vn-nav__link::attr(href) selector.
            for link in response.css('a:contains("/fr/cat/")::attr(href)').getall()
                yield scrapy.Request(link, callback=self.parse_category)

        def parse_category(self, response):
            # parse items or potential sub categories

Upvotes: 0

Bobafotz
Bobafotz

Reputation: 65

I've read a lot of things regarding my problem, and, apparently, the CrawlSpider class is not specific enough. It might explain why it misses some links, for some reasons I can't explain. Basically, it is advised to use the BaseSpider class with start_requests and make_requests_from_url method to do the job in a more specific way. I am still not completely sure on how to do it precisely. that was just a hint.

Upvotes: 0

Related Questions