Reputation: 259
I am running a Scrapy spider on 400 webpages. The first days it was running as expected, scraping every minute about 500 pages. Yet, after the first days had passed, the spider started to show some unexpected behavior; it occured from the log files that there were periods of longer than an hour (and often a couple of hours, see terminal output below) in which no pages were crawled. I am a bit puzzled about the reason for this behavior. Possible reasons I have ruled out:
What other reasons could explain the Scraper to pause for hours to continue afterwards again?
2020-11-11 05:03:38 [scrapy.extensions.logstats] INFO: Crawled 1043749 pages (at 487 pages/min), scraped 940521 items (at 427 items/min)
2020-11-11 06:27:49 [scrapy.extensions.logstats] INFO: Crawled 1043771 pages (at 22 pages/min), scraped 940592 items (at 71 items/min)
2020-11-11 06:28:49 [scrapy.extensions.logstats] INFO: Crawled 1044370 pages (at 599 pages/min), scraped 941141 items (at 549 items/min)
Upvotes: 1
Views: 508
Reputation: 259
Following @Gallaecio suggestion that my system just might struggle with printing the INFO logs, I investigated the RAM consumption of my scraper using Task Manager. It soon turned out that after a day or so I was consuming most of my RAM. Inspecting the number of queued request in the Telenet console showed that I was having a problem with too many requests for keeping in my RAM.
I have tried to address this in two ways:
from urllib.parse import urlparse from threading import Lock from scrapy.exceptions import IgnoreRequest, NotConfigured class DomainlimitMiddleware: def __init__(self, settings): self.lock = Lock() self.domain_data = {} self.max_requests_per_domain = settings.getint('MAX_REQUESTS_PER_DOMAIN') if self.max_requests_per_domain < 1: raise NotConfigured() @classmethod def from_crawler(cls, crawler): return cls(crawler.settings) def process_request(self, request, spider): parsed = urlparse(request.url) num_requests = 0 with self.lock: num_requests = self.domain_data.get(parsed.netloc, 0) if num_requests == 0: self.domain_data[parsed.netloc] = 1 else: self.domain_data[parsed.netloc] = num_requests + 1 if num_requests > self.max_requests_per_domain: raise IgnoreRequest('Domain has hit the maximum number of requests processed') return None
And activated it by adding this to my settings.py:
MAX_REQUESTS_PER_DOMAIN = 50000
DOWNLOADER_MIDDLEWARES = {'<myproject>.middlewares.DomainlimitMiddleware': 543, }
Subsequently I ran my scraper in the commandline using:
scrapy crawl {spidername} -s JOBDIR=crawls/{spidername}
The advantage of saving the requests to your disk is that it also allows the scraper too pause and resume afterwards.
Upvotes: 1