Scrapy request in an infinite loop until specific callback result

Question

I want to call N scrapy requests from start_requests. This value is dynamic since I want to loop through all pages in an API. I do not know the limit number of pages before hand. But I know that when I exceed the number of pages, the response of the API will be an empty json. I want to do something like:

url = "https://example.com?page={}"
def start_requests(self):
    page = 0
    while True:
        page += 1
        yield scrapy.Request(url=url.format(page), callback=self.parse)


def parse(self, response, **kwargs):
    data = json.loads(response.body)
    if 'key' in data:
        # parse and yield an item
        pass
    else: 
        # do not yield an item and break while loop in start_requests

I do not know how to achieve this. Can I return a value from callback (instead of yield) when condition is met?

Alexander · Accepted Answer

No, but you can set a class attribute to be a flag that indicates the start_requests should no longer continue:

for example:

class MySpider(scrapy.Spider):
    
    url = "https://example.com?page={}"
    keep_crawling = True
    def start_requests(self):
        page = 0
        while self.keep_crawling:
            page += 1
            yield scrapy.Request(url=url.format(page), callback=self.parse)


    def parse(self, response, **kwargs):
        data = json.loads(response.body)
        if 'key' in data:
            # parse and yield an item
            pass
        else: 
            self.keep_crawling = False

Scrapy request in an infinite loop until specific callback result

Answers (1)

Related Questions