How to ignore already crawled URLs in Scrapy

Question

I have a crawler that looks something like this:

def parse:
      .......
      ........
      Yield(Request(url=nextUrl,callback=self.parse2))

def parse2:
      .......
      ........
      Yield(Request(url=nextUrl,callback=self.parse3))

def parse3:
      .......
      ........

I want to add a rule wherein I want to ignore if a URL has crawled while invoking function parse2, but keep the rule for parse3. I am still exploring the requests.seen file to see if I can manipulate that.

Guy Gavriely · Accepted Answer

check out dont_filter request parameter at http://doc.scrapy.org/en/latest/topics/request-response.html

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

How to ignore already crawled URLs in Scrapy

Answers (2)

Related Questions