Sovon
Sovon

Reputation: 1824

Scrapy 'twisted.internet.error.ReactorNotRestartable' error after first run

I'm running Scrapy (version 1.4.0) from scripts using CrawlerProcess. Urls are coming from user inputs. First time it runs well, but in 2nd time, it gives twisted.internet.error.ReactorNotRestartable error. So, program stuck there.

Crawler process section:

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(GeneralSpider)

print('~~~~~~~~~~~~ Processing is going to be started ~~~~~~~~~~')
process.start()
print('~~~~~~~~~~~~ Processing ended ~~~~~~~~~~')
process.stop()

Here is the first run output:

~~~~~~~~~~~~ Processing is going to be started ~~~~~~~~~~
2017-07-17 05:58:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.some-url.com/content.php> (referer: None)
2017-07-17 05:58:46 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'HtmlResponse' in <GET http://www.some-url.com/content.php>
2017-07-17 05:58:46 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-17 05:58:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 261,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 14223,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 7, 17, 5, 58, 46, 760661),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'memusage/max': 49983488,
 'memusage/startup': 49983488,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 7, 17, 5, 58, 45, 162155)}
2017-07-17 05:58:46 [scrapy.core.engine] INFO: Spider closed (finished)
~~~~~~~~~~~~ Processing ended ~~~~~~~~~~

When I try to run 2nd time, it raises error:

~~~~~~~~~~~~ Processing is going to be started ~~~~~~~~~~
[2017-07-17 06:03:18,075] ERROR in app: Exception on /scripts/1/process [GET]
Traceback (most recent call last):
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "api.py", line 13, in process_crawler
    processor.process()
  File "/var/www/python/crawlerapp/application/scripts/general_spider.py", line 124, in process
    process.start()
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/scrapy/crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/twisted/internet/base.py", line 1242, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/twisted/internet/base.py", line 1222, in startRunning
    ReactorBase.startRunning(self)
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/twisted/internet/base.py", line 730, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

How to restart reactor or stop the reactor after every process finishes?

There are some similar questions in Stack Overflow but there solutions are for old versions of Scrapy. Couldn't use those solutions.

Upvotes: 3

Views: 6020

Answers (2)

Georgy K
Georgy K

Reputation: 245

Try to start your function in a separate process:

from multiprocessing.context import Process

def crawl():
    crawler = CrawlerProcess(settings)
    crawler.crawl(MySpider)
    crawler.start()

process = Process(target=crawl)
process.start()
process.join()

Upvotes: 8

Shariful Islam
Shariful Islam

Reputation: 576

you can add this line.

process.start(stop_after_crawl=False)

Hope so your problem will be solve

Thanks

Upvotes: 1

Related Questions