k_wit
k_wit

Reputation: 511

ReactorNotRestartable error in while loop with scrapy

I get twisted.internet.error.ReactorNotRestartable error when I execute following code:

from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher

result = None

def set_result(item):
    result = item

while True:
    process = CrawlerProcess(get_project_settings())
    dispatcher.connect(set_result, signals.item_scraped)

    process.crawl('my_spider')
    process.start()

    if result:
        break
    sleep(3)

For the first time it works, then I get error. I create process variable each time, so what's the problem?

Upvotes: 41

Views: 31228

Answers (11)

After more thand 10 days i understand how can i solve this problem i every 10 secounds i need to execute this scrap then i use code

import time
import schedule
from multiprocessing import Process, Queue
import scrapy.crawler as crawler
from scraper.spiders.blalblaSpider import blalblaSpider

def run_crawler(q, spider):
try:
    custom_settings = {
        'MONGODB_URI': 'mongodb+srv:yourmongo/',
        'MONGODB_DATABASE': 'db',
        'USER_AGENT': 'customuseragent'
    }

    process = crawler.CrawlerProcess(custom_settings)
    process.crawl(spider)
    process.start()
    q.put(None)
except Exception as e:
    q.put(e)


    def run_spider_wrapper():
        run_spider(blalblaSpider)


   def run_spider(spider):
       q = Queue()
       p = Process(target=run_crawler, args=(q, spider))
       p.start()
       result = q.get()
       p.join()

       if result is not None:
          raise result


schedule.every(10).seconds.do(run_spider_wrapper)

if __name__ == "__main__":
    while True:
         schedule.run_pending()
         time.sleep(5)

and in my spider is

    import scrapy

from ..items import ScraperblalblaSpider


class blalblaSpider(scrapy.Spider):
name = "blalbla"
allowed_domains = ["www.blablabla.com"]
start_urls = ["https://www.blablalbal.com"]

custom_settings = {
    'ITEM_PIPELINES': {'scraper.pipelines.yours': 300}
}

def parse(self, response):
    pas

i hope work for someone like me that need execute scrap by x time

Upvotes: 0

Vladyslav Babenko
Vladyslav Babenko

Reputation: 51

My way is multiprocessing use Process #create spider

class PricesSpider(scrapy.Spider):
      name = 'prices'
      allowed_domains = ['index.minfin.com.ua']
      start_urls = ['https://index.minfin.com.ua/ua/markets/fuel/tm/']

    def parse(self, response):
        pass

Than I create func which run my spider

#run spider

from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor

def parser():
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    runner = CrawlerRunner()
    d = runner.crawl(PricesSpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run()

Than I create new Python file, import here func 'parser' and create schedule for my spider

#create schedule for spider

import schedule
from  import parser
from multiprocessing import Process


def worker(pars):
    print('Worker starting')
    pr = Process(target=parser)
    pr.start()
    pr.join()


def main():
    schedule.every().day.at("15:00").do(worker, parser)
    # schedule.every().day.at("20:21").do(worker, parser)
    # schedule.every().day.at("20:23").do(worker, parser)
    # schedule.every(1).minutes.do(worker, parser)
    print('Spider working now')
    while True:
        schedule.run_pending()


if __name__ == '__main__':
    main()

Upvotes: 1

snarik
snarik

Reputation: 1225

If you're trying to get a flask or django or fast-api service that is running into this. You've tried all the things people suggest about forking a new process to run the reactor-- none of it seems to work.

Stop what you're doing and go read this: https://github.com/notoriousno/scrapy-flask

Crochet is your best opportunity to get this working within gunicorn without writing your own crawler from scratch.

Upvotes: 1

Kishan Patel
Kishan Patel

Reputation: 61

I faced error ReactorNotRestartable on AWS lambda and after I came to this solution

By default, the asynchronous nature of scrapy is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.

Instead, we can use `

import scrapy
import scrapy.crawler as crawler
rom scrapy.spiders import CrawlSpider
import scrapydo

scrapydo.setup()

# your spider
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())

scrapydo.run_spider(QuotesSpider)

` to run your existing spider in a blocking fashion:

Upvotes: 3

Gihan Gamage
Gihan Gamage

Reputation: 3354

For a particular process once you call reactor.run() or process.start() you cannot rerun those commands. The reason is the reactor cannot be restarted. The reactor will stop execution once the script completes the execution.

So the best option is to use different subprocesses if you need to run the reactor multiple times.

you can add the content of while loop to a function(say execute_crawling). Then you can simply run this using different subprocesses. For this python Process module can be used. Code is given below.

from multiprocessing import Process
def execute_crawling():
    process = CrawlerProcess(get_project_settings())#same way can be done for Crawlrunner
    dispatcher.connect(set_result, signals.item_scraped)
    process.crawl('my_spider')
    process.start()

if __name__ == '__main__':
for k in range(Number_of_times_you_want):
    p = Process(target=execute_crawling)
    p.start()
    p.join() # this blocks until the process terminates

Upvotes: 7

Mikhail Kravets
Mikhail Kravets

Reputation: 617

I could advice you to run scrapers using subprocess module

from subprocess import Popen, PIPE

spider = Popen(["scrapy", "crawl", "spider_name", "-a", "argument=value"], stdout=PIPE)

spider.wait()

Upvotes: 0

Daniel Wyatt
Daniel Wyatt

Reputation: 1151

I had a similar issue using Spyder. Running the file from the command line instead fixed it for me.

Spyder seems to work the first time but after that it doesn't. Maybe the reactor stays open and doesn't close?

Upvotes: 0

DovaX
DovaX

Reputation: 1026

I was able to mitigate this problem using package crochet via this simple code based on Christian Aichinger's answer to the duplicate of this question Scrapy - Reactor not Restartable. The initialization of Spiders is done in the main thread whereas the particular crawling is done in different thread. I'm using Anaconda (Windows).

import time
import scrapy
from scrapy.crawler import CrawlerRunner
from crochet import setup

class MySpider(scrapy.Spider):
    name = "MySpider"
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/ip']

    def parse(self, response):
        print(response.text)
        for i in range(1,6):
            time.sleep(1)
            print("Spider "+str(self.name)+" waited "+str(i)+" seconds.")

def run_spider(number):
    crawler = CrawlerRunner()
    crawler.crawl(MySpider,name=str(number))

setup()
for i in range(1,6):
    time.sleep(1)
    print("Initialization of Spider #"+str(i))
    run_spider(i)

Upvotes: 0

Alexis Mejía
Alexis Mejía

Reputation: 51

Ref http://crawl.blog/scrapy-loop/

 import scrapy
 from scrapy.crawler import CrawlerProcess
 from scrapy.utils.project import get_project_settings     
 from twisted.internet import reactor
 from twisted.internet.task import deferLater

 def sleep(self, *args, seconds):
    """Non blocking sleep callback"""
    return deferLater(reactor, seconds, lambda: None)

 process = CrawlerProcess(get_project_settings())

 def _crawl(result, spider):
    deferred = process.crawl(spider)
    deferred.addCallback(lambda results: print('waiting 100 seconds before 
    restart...'))
    deferred.addCallback(sleep, seconds=100)
    deferred.addCallback(_crawl, spider)
    return deferred


_crawl(None, MySpider)
process.start()

Upvotes: 5

Sagun Shrestha
Sagun Shrestha

Reputation: 1198

I was able to solve this problem like this. process.start() should be called only once.

from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher

result = None

def set_result(item):
    result = item

while True:
    process = CrawlerProcess(get_project_settings())
    dispatcher.connect(set_result, signals.item_scraped)

    process.crawl('my_spider')

process.start()

Upvotes: 6

paul trmbrth
paul trmbrth

Reputation: 20748

By default, CrawlerProcess's .start() will stop the Twisted reactor it creates when all crawlers have finished.

You should call process.start(stop_after_crawl=False) if you create process in each iteration.

Another option is to handle the Twisted reactor yourself and use CrawlerRunner. The docs have an example on doing that.

Upvotes: 21

Related Questions