Reputation: 4419
The official docs give many ways for running scrapy
crawlers from code:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
But all of them block script until crawling is finished. What's the easiest way in python to run the crawler in a non-blocking, async manner?
Upvotes: 3
Views: 3002
Reputation: 11
Netimen's answer is correct. process.start()
calls reactor.run()
, which blocks the thread. Just that I don't think it is necessary to subclass billiard.Process
. Although poorly documented, billiard.Process
does have a set of APIs to call another function asynchronously without subclassing.
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from billiard import Process
crawler = CrawlerProcess(get_project_settings())
process = Process(target=crawler.start, stop_after_crawl=False)
def crawl(*args, **kwargs):
crawler.crawl(*args, **kwargs)
process.start()
Note that if you don't have stop_after_crawl=False
, you may run into ReactorNotRestartable
exception when you run the crawler for more than twice.
Upvotes: 1
Reputation: 4419
I tried every solution I could find, and the only working for me was this. But in order to make it work with scrapy 1.1rc1
I had to tweak it a little bit:
from scrapy.crawler import Crawler
from scrapy import signals
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from billiard import Process
class CrawlerScript(Process):
def __init__(self, spider):
Process.__init__(self)
settings = get_project_settings()
self.crawler = Crawler(spider.__class__, settings)
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
self.spider = spider
def run(self):
self.crawler.crawl(self.spider)
reactor.run()
def crawl_async():
spider = MySpider()
crawler = CrawlerScript(spider)
crawler.start()
crawler.join()
So now when I call crawl_async
, it starts crawling and doesn't block my current thread. I'm absolutely new to scrapy
, so may be this isn't a very good solution but it worked for me.
I used these versions of the libraries:
cffi==1.5.0
Scrapy==1.1rc1
Twisted==15.5.0
billiard==3.3.0.22
Upvotes: 5