Reputation: 3
I've found two answers to this questions but none is working for me. Basically, I want to restrict the amount of pages crawled per domain. Here's the code in the actual crawler:
def parse_page(self, response)
visited_count.append(response.url.split('/')[2])
if visited_count.count(response.url.split('/')[2]) > 49:
print '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'
denied.append(response.url)
and Custom Middleware:
class IgnoreDomain(object):
def process_requests(request, spider):
if request in spider.denied:
return IgnoreRequest()
else:
return None
Of course middleware is mentioned in settings. I would really appreciate if you could point out what I'm doing wrong.
Upvotes: 0
Views: 317
Reputation: 21351
You said I want to restrict the amount of pages crawled per domain
...
Do this, create counter
in your spider
class YourSpider(scrapy.Spider):
counter = {}
#counter will have values like {'google.com': 4, 'website.com': 2} etc
At top of your middleware file write this
from scrapy.exceptions import IgnoreRequest
import logging
class YourMiddleware(object):
def process_request(self, request, spider):
domain = tldextract.extract(request.url)[1]
logging.info(spider.counter)
if domain not in spider.counter:
pass #keep scraping this link
else:
if spider.counter[domain] > 5:
raise IgnoreRequest()
else:
pass #keep processing this request
def process_response(self, request, response, spider):
domain = tldextract.extract(request.url)[1]
if domain not in spider.counter:
spider.counter[domain] = 1
else:
spider.counter[domain] = spider.counter[domain] + 1
return response
Upvotes: 1