rodeo
rodeo

Reputation: 25

How to pass data between sequential spiders

I have two spiders that run in sequential order according to https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process. Now I want to pass some information from the first spider to the second (a selenium webdriver, or it's session information).

I'm quite new to scrapy, but on another post it was proposed to save the data to a db and retrieve it from there. This seems a bit too much for just passing one variable, is there no other way? (I know in this example I could just make that into one long spider, but later I would like to run the first spider once but the second spider multiple times.)

class Spider1(scrapy.Spider):
    # Open a webdriver and get session_id

class Spider2(scrapy.Spider):
    # Get the session_id  and run spider2 code
    def __init__(self, session_id):
        ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1)
    # TODO How to get the session_id?
    # session_id = yield runner.crawl(Spider1) returns None
    # Or adding return statement in Spider 1, actually breaks 
    # sequential processing and program sleeps before running Spider1

    time.sleep(2)

    yield runner.crawl(Spider2(session_id))
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

I would like to pass the variable to the constructor of the second spider, but I'm unable to get the data from the first one. If I just run the first crawler to return the variable, it apparently breaks the sequential structure. If I try to retrieve the yield, the result is None.

Am I completely blind? I can't believe that this should be such a complex task.

Upvotes: 0

Views: 215

Answers (2)

rodeo
rodeo

Reputation: 25

You can also just create the webdriver before and pass it as arguments. When I tried this initially, it didn't work because I passed the arguments incorrectly (see my comment on the post).

class Spider1(scrapy.Spider):
    def __init__(self, driver=None):
        self.driver = driver # Do whatever with the driver

class Spider2(scrapy.Spider):
   def __init__(self, driver=None):
       self.driver = driver # This is the same driver as Spider 1 used


configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    driver = webdriver.Chrome()

    yield runner.crawl(Spider1, driver=driver)
    yield runner.crawl(Spider2, driver=driver)

    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

Upvotes: 0

Kamoo
Kamoo

Reputation: 872

You can pass a queue to both spiders, and let spider2 block on queue.get(), so there is no need for time.sleep(2).

# globals.py

queue = Queue()
# run.py

import globals


class Spider1(scrapy.Spider):
    def __init__(self):
        # put session_id to `globals.queue` somewhere in `Spider1`, so `Spider2` can start.
        ...

class Spider2(scrapy.Spider):
    def __init__(self):
        session_id = globals.queue.get()

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1)
    yield runner.crawl(Spider2)
    reactor.stop()

crawl()
reactor.run() 

Upvotes: 1

Related Questions