Start scrapy multiple spider without blocking the process

Question

I'm trying to execute scrapy spider in separate script and when I execute this script in a loop (for instance run the same spider with different parameters), I get ReactorAlreadyRunning. My snippet:

from celery import task
from episode.skywalker.crawlers import settings
from multiprocessing.queues import Queue
from scrapy import log, project, signals
from scrapy.settings import CrawlerSettings
from scrapy.spider import BaseSpider
from scrapy.spidermanager import SpiderManager
from scrapy.xlib.pydispatch import dispatcher
import multiprocessing
from twisted.internet.error import ReactorAlreadyRunning


class CrawlerWorker(multiprocessing.Process):

    def __init__(self, spider, result_queue):
        from scrapy.crawler import CrawlerProcess
        multiprocessing.Process.__init__(self)
        self.result_queue = result_queue
        self.crawler = CrawlerProcess(CrawlerSettings(settings))
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def run(self):
        self.crawler.crawl(self.spider)
        try:
            self.crawler.start()
        except ReactorAlreadyRunning:
            pass

        self.crawler.stop()
        self.result_queue.put(self.items)


@task
def execute_spider(spider, **spider__kwargs):
    '''
    Execute spider within separate process
    @param spider: spider class to crawl or the name (check if instance)
    '''

    if not isinstance(spider, BaseSpider):
        manager = SpiderManager(settings.SPIDER_MODULES)
        spider = manager.create(spider, **spider__kwargs)
    result_queue = Queue()
    crawler = CrawlerWorker(spider, result_queue)
    crawler.start()
    items = []

    for item in result_queue.get():
        items.append(item)

My suggestion is that it caused by multiple twisted reactor runs. How can I avoid it? Is there in general a way to run the spiders without reactor?

Thorin Schiffer · Accepted Answer

I figured out, what caused the problem: if you call execute_spider method somehow in CrawlerWorker process (for instance via recursion ), it causes creating second reactor, which is not possible.

My solution: to move all statements, causing recursive calls, in execute_spider method, so they will trigger the spider execution in the same process, not in secondary CrawlerWorker. I also built in such a statement

try:
        self.crawler.start()
except ReactorAlreadyRunning:
        raise RecursiveSpiderCall("Spider %s was called from another spider recursively. Such behavior is not allowed" % (self.spider))

to catch unintentionally recursive calls of spiders.

Start scrapy multiple spider without blocking the process

Answers (1)

Related Questions