Easiest way to run scrapy crawler so it doesn't block the script

Question

The official docs give many ways for running scrapy crawlers from code:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

But all of them block script until crawling is finished. What's the easiest way in python to run the crawler in a non-blocking, async manner?

netimen · Accepted Answer

I tried every solution I could find, and the only working for me was this. But in order to make it work with scrapy 1.1rc1 I had to tweak it a little bit:

from scrapy.crawler import Crawler
from scrapy import signals
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from billiard import Process

class CrawlerScript(Process):
    def __init__(self, spider):
        Process.__init__(self)
        settings = get_project_settings()
        self.crawler = Crawler(spider.__class__, settings)
        self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        self.spider = spider

    def run(self):
        self.crawler.crawl(self.spider)
        reactor.run()

def crawl_async():
    spider = MySpider()
    crawler = CrawlerScript(spider)
    crawler.start()
    crawler.join()

So now when I call crawl_async, it starts crawling and doesn't block my current thread. I'm absolutely new to scrapy, so may be this isn't a very good solution but it worked for me.

I used these versions of the libraries:

cffi==1.5.0
Scrapy==1.1rc1
Twisted==15.5.0
billiard==3.3.0.22

Easiest way to run scrapy crawler so it doesn't block the script

Answers (2)

Related Questions

Easiest way to run scrapy crawler so it doesn&#39;t block the script

Answers (2)

Related Questions

Easiest way to run scrapy crawler so it doesn't block the script