Reputation: 1824
I'm running Scrapy (version 1.4.0) from scripts using CrawlerProcess
. Urls are coming from user inputs. First time it runs well, but in 2nd time, it gives twisted.internet.error.ReactorNotRestartable
error. So, program stuck there.
Crawler process section:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(GeneralSpider)
print('~~~~~~~~~~~~ Processing is going to be started ~~~~~~~~~~')
process.start()
print('~~~~~~~~~~~~ Processing ended ~~~~~~~~~~')
process.stop()
Here is the first run output:
~~~~~~~~~~~~ Processing is going to be started ~~~~~~~~~~
2017-07-17 05:58:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.some-url.com/content.php> (referer: None)
2017-07-17 05:58:46 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'HtmlResponse' in <GET http://www.some-url.com/content.php>
2017-07-17 05:58:46 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-17 05:58:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 261,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 14223,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 7, 17, 5, 58, 46, 760661),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'memusage/max': 49983488,
'memusage/startup': 49983488,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 7, 17, 5, 58, 45, 162155)}
2017-07-17 05:58:46 [scrapy.core.engine] INFO: Spider closed (finished)
~~~~~~~~~~~~ Processing ended ~~~~~~~~~~
When I try to run 2nd time, it raises error:
~~~~~~~~~~~~ Processing is going to be started ~~~~~~~~~~
[2017-07-17 06:03:18,075] ERROR in app: Exception on /scripts/1/process [GET]
Traceback (most recent call last):
File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1982, in wsgi_app
response = self.full_dispatch_request()
File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1614, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1517, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/_compat.py", line 33, in reraise
raise value
File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1612, in full_dispatch_request
rv = self.dispatch_request()
File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1598, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "api.py", line 13, in process_crawler
processor.process()
File "/var/www/python/crawlerapp/application/scripts/general_spider.py", line 124, in process
process.start()
File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/scrapy/crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/twisted/internet/base.py", line 1242, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/twisted/internet/base.py", line 1222, in startRunning
ReactorBase.startRunning(self)
File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/twisted/internet/base.py", line 730, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
How to restart reactor or stop the reactor after every process finishes?
There are some similar questions in Stack Overflow but there solutions are for old versions of Scrapy. Couldn't use those solutions.
Upvotes: 3
Views: 6020
Reputation: 245
Try to start your function in a separate process:
from multiprocessing.context import Process
def crawl():
crawler = CrawlerProcess(settings)
crawler.crawl(MySpider)
crawler.start()
process = Process(target=crawl)
process.start()
process.join()
Upvotes: 8
Reputation: 576
you can add this line.
process.start(stop_after_crawl=False)
Hope so your problem will be solve
Thanks
Upvotes: 1