olegario
olegario

Reputation: 742

Scrapy - Spiders taking too long to being shut down

Basically, I have a file named spiders.py in which I configure all my spiders and fire then all, using a single crawler. This is the source code of this file:

from scrapy import spiderloader
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from navigator import *


def main():
  settings = get_project_settings()
  spider_loader = spiderloader.SpiderLoader.from_settings(settings)
  process = CrawlerProcess(settings=settings)
  for spider_name in spider_loader.list():
      process.crawl(spider_name)

  process.start()


if __name__ == '__main__':
  main()

What I'm trying to achieve is to fire this spiders from another script, using subprocess module, and after 5 minutes of execution, turning down all spiders (using only one SIGTERM). The file responsible for this objective is monitor.py:

from time import sleep
import os
import signal
import subprocess

def main():
  spiders_process = subprocess.Popen(["python", "spiders.py"], stdout=subprocess.PIPE,
                                      shell=False, preexec_fn=os.setsid)
  sleep(300)
  os.killpg(spiders_process.pid, signal.SIGTERM)

if __name__ == '__main__':
  main()

When the main thread wake up, the terminal says 2018-07-19 21:45:09 [scrapy.crawler] INFO: Received SIGTERM, shutting down gracefully. Send again to force . But even after this message, the spiders continue to scrap the web pages. What I doing wrong?

OBS : It is possible to fire all spiders inside spiders.py without blocking the main process?

Upvotes: 2

Views: 1584

Answers (1)

John Smith
John Smith

Reputation: 686

I believe when scrapy receives a SIGTERM it tries to shutdown gracefully by first waiting to finished all sent/scheduled requests. Your best bet is to either limit the number or concurrent requests so it finishes quicker (CONCURRENT_REQUESTS/CONCURRENT_REQUESTS_PER_DOMAIN are 16 and 8 respectively by default) or to send two SIGTERM's to instruct scrapy to do unclean immediate exit.

OBS : It is possible to fire all spiders inside spiders.py without blocking the main process?

process.start() starts twisted reactor (twisted main event loop) which is a blocking call, to circumvent it and run more code after the reactor has been started you can schedule a function be run inside the loop. First snippet from this manual should give you an idea: https://twistedmatrix.com/documents/current/core/howto/time.html.

However, if you go that way, you must make sure that the code you schedule have to also be non blocking, otherwise when you pause the execution of the loop for too long bad things can start happening. So things like time.sleep() must be rewritten in a twisted equivalent.

Upvotes: 1

Related Questions