gururaj bagali
gururaj bagali

Reputation: 11

Run multiple spiders from script in scrapy

I am doing scrapy project I want run multiple spiders at a time This is code for run spiders from script. I getting error .. how to do

 from spiders.DmozSpider import DmozSpider
 from spiders.CraigslistSpider import CraigslistSpider

 from scrapy import signals, log
 from twisted.internet import reactor
 from scrapy.crawler import Crawler
 from scrapy.settings import Settings

 TO_CRAWL = [DmozSpider, CraigslistSpider]


 RUNNING_CRAWLERS = []

def spider_closing(spider):
"""Activates on spider closed signal"""
log.msg("Spider closed: %s" % spider, level=log.INFO)
RUNNING_CRAWLERS.remove(spider)
if not RUNNING_CRAWLERS:
    reactor.stop()

log.start(loglevel=log.DEBUG) for spider in TO_CRAWL: settings = Settings()

 # crawl responsibly
 settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
 crawler = Crawler(settings)
 crawler_obj = spider()
 RUNNING_CRAWLERS.append(crawler_obj)

# stop reactor when spider closes
crawler.signals.connect(spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(crawler_obj)
crawler.start()

blocks process so always keep as the last statement

reactor.run()

Upvotes: 1

Views: 1593

Answers (2)

hungneox
hungneox

Reputation: 9829

You need something like the code below. You can easily find it from Scrapy docs :)

First utility you can use to run your spiders is scrapy.crawler.CrawlerProcess. This class will start a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all Scrapy commands.

# -*- coding: utf-8 -*-
import sys
import logging
import traceback
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from scrapy.utils.project import get_project_settings
from spiders.DmozSpider import DmozSpider
from spiders.CraigslistSpider import CraigslistSpider

SPIDER_LIST = [
    DmozSpider, CraigslistSpider
]

if __name__ == "__main__":
    try:
        ## set up the crawler and start to crawl one spider at a time
        process = CrawlerProcess(get_project_settings())
        for spider in SPIDER_LIST:
            process.crawl(spider)
        process.start()
    except Exception, e:
        exc_type, exc_obj, exc_tb = sys.exc_info()
        logging.info('Error on line {}'.format(sys.exc_info()[-1].tb_lineno))
        logging.info("Exception: %s" % str(traceback.format_exc()))

References: http://doc.scrapy.org/en/latest/topics/practices.html

Upvotes: 2

neverlastn
neverlastn

Reputation: 2204

Sorry to not answer the question itself but just bringing into your attention scrapyd and scrapinghub (at least for a quick test). reactor.run() (when you make it) will run any number of Scrapy instances on a single CPU. Do you want this side effect? Even if you have a look on scrapyd's code, they don't run multiple instances with a single thread but they do fork/spawn subprocesses.

Upvotes: 2

Related Questions