Username
Username

Reputation: 3663

How do I run multiple Scrapy spiders that each scrape a different URL?

I have a spiders.py in a Scrapy project with the following spiders...

class OneSpider(scrapy.Spider):
    name = "s1"

    def start_requests(self):
        urls = ["url1.com",]
        yield scrapy.Request(
            url="http://url1.com",
            callback=self.parse
        )

    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff

class TwoSpider(scrapy.Spider):
    name = "s2"

    def start_requests(self):
        urls = ["url2.com",]
        yield scrapy.Request(
            url="http://url2.com",
            callback=self.parse
        )

    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff

How do I run spiders s1 and s2, and write their scraped results to s1.json and s2.json?

Upvotes: 0

Views: 389

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21406

Scrapy doesn't support running multiple spiders as a single process, so you'd simply run two processes:

scrapy crawl s1 -o s1.json
scrapy crawl s2 -o s2.json

if you want to do it in the same terminal window you'd have to either:

  • run 1st spider -> put it to background (ctrl+z) -> run 2nd spider
  • use nohup, e.g.:

    nohup scrapy crawl s1 -o s1.json --logfile s1.log &
    
  • use screen command.

    $ screen
    $ scrapy crawl s1 -o s1.json
    $ ctrl+a ctrL+d  # detach screen
    $ screen
    $ scrapy crawl s2 -o s2.json
    $ ctrl+a ctrL+d  # detach screen
    $ screen -r  # to reattach to one of your sessions to see how the spider is doing
    

Personally I prefer nohup or screen options as they are clean and do not mess up your terminal with logging and whatnot.

Upvotes: 1

Related Questions