Count scraped items from scrapy during execution and pause or sleep after a certain number of pages

Question

I have more than 210000 links which I want to scrape. Is there any way we can print how many links it has completed scraping during the execution and sleep or pause execution for 10 mins after every 10000 page count?

Felix Ekl&#246;f · Accepted Answer

If you just want to print the page number it's done scrapy you can do something like this. Keep in mind tho that Scrapy is not guarteed to parse the pages in the order that they were yielded.

def start_requests(self)
     for i in range(1,210000):
         yield scrapy.Request(
            url=f'https://someurl.com?page={str(i)}',
            meta={'page': i}
         )

def parse(self, response):
    page = response.meta.get('page')
    print('Parsed page #' + str(page))

If you want to see the "progress", you can do something like this:

def __init__(self, *args, **kwargs):
    self.parsed_pages = 0
    self.total_pages = 210000
    super().__init__(*args, **kwargs)


def start_requests(self)
     for i in range(1, self.total_pages):
         yield scrapy.Request(
            url=f'https://someurl.com?page={str(i)}',
            meta={'page': i}
         )

def parse(self, response):
    page = response.meta.get('page')
    self.parsed_pages += 1
    print(f'Parsed {str(self.parsed_pages)} of {str(self.total_pages)}')'

If you want to pause for 10 minutes, I would not recommend using some sleep function. Since that will block code. (If you have no other spiders the I guess you could do it, but it's not a very good practice).

Instead I would schedule the spider to run at an interval and limit to 10 000 pages.

Or, if you just need to limit the number of requests you're sending out, you can just set DOWNLOAD_DELAY in settings.py to some integer that will reduce load on the target server.

Count scraped items from scrapy during execution and pause or sleep after a certain number of pages

Answers (1)

Related Questions