Reputation: 11
I am doing scrapy project I want run multiple spiders at a time This is code for run spiders from script. I getting error .. how to do
from spiders.DmozSpider import DmozSpider
from spiders.CraigslistSpider import CraigslistSpider
from scrapy import signals, log
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
TO_CRAWL = [DmozSpider, CraigslistSpider]
RUNNING_CRAWLERS = []
def spider_closing(spider):
"""Activates on spider closed signal"""
log.msg("Spider closed: %s" % spider, level=log.INFO)
RUNNING_CRAWLERS.remove(spider)
if not RUNNING_CRAWLERS:
reactor.stop()
log.start(loglevel=log.DEBUG) for spider in TO_CRAWL: settings = Settings()
# crawl responsibly
settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
crawler = Crawler(settings)
crawler_obj = spider()
RUNNING_CRAWLERS.append(crawler_obj)
# stop reactor when spider closes
crawler.signals.connect(spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(crawler_obj)
crawler.start()
reactor.run()
Upvotes: 1
Views: 1593
Reputation: 9829
You need something like the code below. You can easily find it from Scrapy docs :)
First utility you can use to run your spiders is scrapy.crawler.CrawlerProcess. This class will start a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all Scrapy commands.
# -*- coding: utf-8 -*-
import sys
import logging
import traceback
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from scrapy.utils.project import get_project_settings
from spiders.DmozSpider import DmozSpider
from spiders.CraigslistSpider import CraigslistSpider
SPIDER_LIST = [
DmozSpider, CraigslistSpider
]
if __name__ == "__main__":
try:
## set up the crawler and start to crawl one spider at a time
process = CrawlerProcess(get_project_settings())
for spider in SPIDER_LIST:
process.crawl(spider)
process.start()
except Exception, e:
exc_type, exc_obj, exc_tb = sys.exc_info()
logging.info('Error on line {}'.format(sys.exc_info()[-1].tb_lineno))
logging.info("Exception: %s" % str(traceback.format_exc()))
References: http://doc.scrapy.org/en/latest/topics/practices.html
Upvotes: 2
Reputation: 2204
Sorry to not answer the question itself but just bringing into your attention scrapyd and scrapinghub (at least for a quick test). reactor.run()
(when you make it) will run any number of Scrapy instances on a single CPU. Do you want this side effect? Even if you have a look on scrapyd
's code, they don't run multiple instances with a single thread but they do fork/spawn subprocesses.
Upvotes: 2