Scrapy close spider if no urls to crawl

Question

I have a spider which takes url from a redis list.

I want to close spider nicely when there is no url found. I tried to implement CloseSpider exception, but it seems that it doesn't reach this point

def start_requests(self):
    while True:
        item = json.loads(self.__pop_queue())
        if not item:
            raise CloseSpider("Closing spider because no more urls to crawl")
        try:
            yield scrapy.http.Request(item['product_url'], meta={'item': item})
        except ValueError:
            continue

Even though i am raising the CloseSpider exception but I am still getting the below error:

root@355e42916706:/scrapper# scrapy crawl general -a country=my -a log=file
2017-07-17 12:05:13 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/scrapper/scrapper/spiders/GeneralSpider.py", line 20, in start_requests
    item = json.loads(self.__pop_queue())
  File "/usr/local/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

Moreover, I also tried to catch TypeError in the same function but it doesn't work also.

Is there any recommended way to handle this

Thanks

zwer · Accepted Answer

You need to check if self.__pop_queue() returns something before you give it to json.loads() (or capture the TypeError when calling it), something like:

def start_requests(self):
    while True:
        item = self.__pop_queue()
        if not item:
            raise CloseSpider("Closing spider because no more urls to crawl")
        try:
            item = json.loads(item)
            yield scrapy.http.Request(item['product_url'], meta={'item': item})
        except (ValueError, TypeError):  # just in case the 'item' is not a string or buffer
            continue

Scrapy close spider if no urls to crawl

Answers (2)

Related Questions