Scrapy: Running one spider, then using information gathered to run another spider

Question

In the Scrapy docs, the example they give for running multiple spiders is something like this:

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()

However, the problem is that I want to run Spider1, parse the data, and then use the extracted data to run Spider2. If I do something like:

process.crawl(MySpider1)
process.start()
parse_data_from_spider1()
pass_data_to_spider2_class()

process2.crawl(MySpider2)
process2.start()

It gives me the dreaded ReactorNotRestartable error. Could someone guide me on how to do what I'm trying to achieve here?

pwinz · Accepted Answer

The code you're using from the docs runs multiple spiders in the same process using the internal API, so that's an issue if you need to wait for the first spider to finish before starting the second.

If this is the entire scope of the issue, my suggestions would be to store the data from the the first spider a place where the second one can consume it (database, csv, jsonlines), and bring that data into the second spider run, either in the spider definition (where name is defined, or if you've got subclasses of scrapy.Spider, maybe in the __init__) or in the start_requests() method.

Then you'll have to run the spiders sequentially, you can see the CrawlerRunner() chaining deferred method in the common practices section of the docs.

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run()

Scrapy: Running one spider, then using information gathered to run another spider

Answers (1)

Related Questions