Ger
Ger

Reputation: 259

Scrapy spider pauses unexpectedly to continue after several hours

I am running a Scrapy spider on 400 webpages. The first days it was running as expected, scraping every minute about 500 pages. Yet, after the first days had passed, the spider started to show some unexpected behavior; it occured from the log files that there were periods of longer than an hour (and often a couple of hours, see terminal output below) in which no pages were crawled. I am a bit puzzled about the reason for this behavior. Possible reasons I have ruled out:

What other reasons could explain the Scraper to pause for hours to continue afterwards again?

2020-11-11 05:03:38 [scrapy.extensions.logstats] INFO: Crawled 1043749 pages (at 487 pages/min), scraped 940521 items (at 427 items/min)
2020-11-11 06:27:49 [scrapy.extensions.logstats] INFO: Crawled 1043771 pages (at 22 pages/min), scraped 940592 items (at 71 items/min)
2020-11-11 06:28:49 [scrapy.extensions.logstats] INFO: Crawled 1044370 pages (at 599 pages/min), scraped 941141 items (at 549 items/min)

Upvotes: 1

Views: 508

Answers (1)

Ger
Ger

Reputation: 259

Following @Gallaecio suggestion that my system just might struggle with printing the INFO logs, I investigated the RAM consumption of my scraper using Task Manager. It soon turned out that after a day or so I was consuming most of my RAM. Inspecting the number of queued request in the Telenet console showed that I was having a problem with too many requests for keeping in my RAM.

I have tried to address this in two ways:

  1. I added a middleware that aimed to reduce the number of requests per domain (to prevent it from being stuck in two large domains). Following this post I added this to my middlewares.py:
from urllib.parse import urlparse from threading import Lock
from scrapy.exceptions import IgnoreRequest, NotConfigured

class DomainlimitMiddleware:

    def __init__(self, settings):
        self.lock = Lock()
        self.domain_data = {}
        self.max_requests_per_domain = settings.getint('MAX_REQUESTS_PER_DOMAIN')
        if self.max_requests_per_domain < 1:
            raise NotConfigured()

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def process_request(self, request, spider):

        parsed = urlparse(request.url)
        num_requests = 0
        with self.lock:
            num_requests = self.domain_data.get(parsed.netloc, 0)
            if num_requests == 0:
                self.domain_data[parsed.netloc] = 1
            else:
                self.domain_data[parsed.netloc] = num_requests + 1

        if num_requests > self.max_requests_per_domain:
            raise IgnoreRequest('Domain has hit the maximum number of requests processed')

        return None

And activated it by adding this to my settings.py:

MAX_REQUESTS_PER_DOMAIN = 50000 
DOWNLOADER_MIDDLEWARES =  {'<myproject>.middlewares.DomainlimitMiddleware': 543, }
  1. Following this post I queued the request on disk instead of in memory by adding this to my settings.py: SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue' SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

Subsequently I ran my scraper in the commandline using:

scrapy crawl {spidername} -s JOBDIR=crawls/{spidername}

The advantage of saving the requests to your disk is that it also allows the scraper too pause and resume afterwards.

Upvotes: 1

Related Questions