Reputation: 855
If the spider gets redirect, then it should do request again, but with different parameters. The callback in second Request is not performed.
If I use different urls
in start
and checker
methods, it's works fine. I think requests are using lazy loads
and this is why my code isn't working, but not sure.
from scrapy.http import Request
from scrapy.spider import BaseSpider
class TestSpider(BaseSpider):
def start(self, response):
return Request(url = 'http://localhost/', callback=self.checker, meta={'dont_redirect': True})
def checker(self, response):
if response.status == 301:
return Request(url = "http://localhost/", callback=self.results, meta={'dont_merge_cookies': True})
else:
return self.results(response)
def results(self, response):
# here I work with response
Upvotes: 1
Views: 2184
Reputation: 382
Not sure if you still need this but I have put together an example. If you have a specific website in mind, we can all definitely take a look at it.
from scrapy.http import Request
from scrapy.spider import BaseSpider
class TestSpider(BaseSpider):
name = "TEST"
allowed_domains = ["example.com", "example.iana.org"]
def __init__(self, **kwargs):
super( TestSpider, self ).__init__(**kwargs)\
self.url = "http://www.example.com"
self.max_loop = 3
self.loop = 0 # We want it to loop 3 times so keep a class var
def start_requests(self):
# I'll write it out more explicitly here
print "OPEN"
checkRequest = Request(
url = self.url,
meta = {"test":"first"},
callback = self.checker
)
return [ checkRequest ]
def checker(self, response):
# I wasn't sure about a specific website that gives 302
# so I just used 200. We need the loop counter or it will keep going
if(self.loop<self.max_loop and response.status==200):
print "RELOOPING", response.status, self.loop, response.meta['test']
self.loop += 1
checkRequest = Request(
url = self.url,
callback = self.checker
).replace(meta = {"test":"not first"})
return [checkRequest]
else:
print "END LOOPING"
self.results(response) # No need to return, just call method
def results(self, response):
print "DONE" # Do stuff here
In settings.py, set this option
DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'
This is actually what turns off the filter for duplicate site requests. It's confusing because the BaseDupeFilter is not actually the default since it doesn't really filter anything. This means we will submit 3 different requests that will loop through the checker method. Also, I am using scrapy 0.16:
>scrapy crawl TEST
>OPEN
>RELOOPING 200 0 first
>RELOOPING 200 1 not first
>RELOOPING 200 2 not first
>END LOOPING
>DONE
Upvotes: 3