Moving onto next url if certain amount of pages have been crawled in scrapy

Question

I've found two answers to this questions but none is working for me. Basically, I want to restrict the amount of pages crawled per domain. Here's the code in the actual crawler:

def parse_page(self, response)
    visited_count.append(response.url.split('/')[2])
        if visited_count.count(response.url.split('/')[2]) > 49:
            print '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'
            denied.append(response.url)

and Custom Middleware:

class IgnoreDomain(object):
    def process_requests(request, spider):
        if request in spider.denied:
            return IgnoreRequest()
        else:
            return None

Of course middleware is mentioned in settings. I would really appreciate if you could point out what I'm doing wrong.

Umair Ayub · Accepted Answer

You said I want to restrict the amount of pages crawled per domain ...

Do this, create counter in your spider

class YourSpider(scrapy.Spider):
    counter = {}
    #counter will have values like {'google.com': 4, 'website.com': 2} etc

At top of your middleware file write this

from scrapy.exceptions import IgnoreRequest

import logging

class YourMiddleware(object):

    def process_request(self, request, spider):

        domain = tldextract.extract(request.url)[1]
        logging.info(spider.counter)
        if domain not in spider.counter:
               pass #keep scraping this link
        else:
               if spider.counter[domain] > 5:
                    raise IgnoreRequest()
               else:
                    pass #keep processing this request

    def process_response(self, request, response, spider):    

        domain = tldextract.extract(request.url)[1]

        if domain not in spider.counter:
               spider.counter[domain] = 1
        else:
               spider.counter[domain] = spider.counter[domain] + 1

        return response

Moving onto next url if certain amount of pages have been crawled in scrapy

Answers (1)

Related Questions