vlad
vlad

Reputation: 855

How can make callbacks in two sequentially Requests with scrapy

If the spider gets redirect, then it should do request again, but with different parameters. The callback in second Request is not performed.

If I use different urls in start and checker methods, it's works fine. I think requests are using lazy loads and this is why my code isn't working, but not sure.

from scrapy.http import Request
from scrapy.spider import BaseSpider

class TestSpider(BaseSpider):

    def start(self, response):
        return Request(url = 'http://localhost/', callback=self.checker, meta={'dont_redirect': True})

    def checker(self, response):
        if response.status == 301:
            return Request(url = "http://localhost/", callback=self.results, meta={'dont_merge_cookies': True})
        else:
            return self.results(response)

    def results(self, response):
        # here I work with response

Upvotes: 1

Views: 2184

Answers (1)

IamnotBatman
IamnotBatman

Reputation: 382

Not sure if you still need this but I have put together an example. If you have a specific website in mind, we can all definitely take a look at it.

from scrapy.http import Request
from scrapy.spider import BaseSpider

class TestSpider(BaseSpider):

    name = "TEST"
    allowed_domains = ["example.com", "example.iana.org"]

    def __init__(self, **kwargs):
        super( TestSpider, self ).__init__(**kwargs)\
        self.url      = "http://www.example.com"
        self.max_loop = 3
        self.loop     = 0  # We want it to loop 3 times so keep a class var

    def start_requests(self):
        # I'll write it out more explicitly here
        print "OPEN"                       
        checkRequest = Request( 
            url      = self.url, 
            meta     = {"test":"first"},
            callback = self.checker 
        )
        return [ checkRequest ]

    def checker(self, response):
        # I wasn't sure about a specific website that gives 302 
        # so I just used 200. We need the loop counter or it will keep going

        if(self.loop<self.max_loop and response.status==200): 
            print "RELOOPING", response.status, self.loop, response.meta['test']
            self.loop += 1

            checkRequest = Request(
                url = self.url,
                callback = self.checker
            ).replace(meta = {"test":"not first"})
            return [checkRequest]
        else:
            print "END LOOPING"
            self.results(response) # No need to return, just call method

    def results(self, response):
        print "DONE"  # Do stuff here

In settings.py, set this option

DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'

This is actually what turns off the filter for duplicate site requests. It's confusing because the BaseDupeFilter is not actually the default since it doesn't really filter anything. This means we will submit 3 different requests that will loop through the checker method. Also, I am using scrapy 0.16:

>scrapy crawl TEST
>OPEN
>RELOOPING 200 0 first
>RELOOPING 200 1 not first
>RELOOPING 200 2 not first
>END LOOPING
>DONE

Upvotes: 3

Related Questions