Reputation: 9024
I have a spider which takes url from a redis list.
I want to close spider nicely when there is no url found. I tried to implement CloseSpider
exception, but it seems that it doesn't reach this point
def start_requests(self):
while True:
item = json.loads(self.__pop_queue())
if not item:
raise CloseSpider("Closing spider because no more urls to crawl")
try:
yield scrapy.http.Request(item['product_url'], meta={'item': item})
except ValueError:
continue
Even though i am raising the CloseSpider exception but I am still getting the below error:
root@355e42916706:/scrapper# scrapy crawl general -a country=my -a log=file
2017-07-17 12:05:13 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/scrapper/scrapper/spiders/GeneralSpider.py", line 20, in start_requests
item = json.loads(self.__pop_queue())
File "/usr/local/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer
Moreover, I also tried to catch TypeError in the same function but it doesn't work also.
Is there any recommended way to handle this
Thanks
Upvotes: 2
Views: 1051
Reputation: 2061
I had the same problem and found a little trick. When the spider is in idle (when it does nothing), I check if there is still something left in the redis queue. If not I close the spider with close_spider
. The following code is located in the spider
class:
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
from_crawler = super(SerpSpider, cls).from_crawler
spider = from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.idle, signal=scrapy.signals.spider_idle)
return spider
def idle(self):
if self.q.llen(self.redis_key) <= 0:
self.crawler.engine.close_spider(self, reason='finished')
Upvotes: 1
Reputation: 25829
You need to check if self.__pop_queue()
returns something before you give it to json.loads()
(or capture the TypeError
when calling it), something like:
def start_requests(self):
while True:
item = self.__pop_queue()
if not item:
raise CloseSpider("Closing spider because no more urls to crawl")
try:
item = json.loads(item)
yield scrapy.http.Request(item['product_url'], meta={'item': item})
except (ValueError, TypeError): # just in case the 'item' is not a string or buffer
continue
Upvotes: 4