xie
xie

Reputation: 91

How can I get the First request url after 302 redirect 301

I using scrapy(ver:1.1.1) for scrapy some date in internet. This is what I face to:

class Link_Spider(scrapy.Spider):
    name = 'GetLink'
    allowed_domains = ['example_0.com']
    with codecs.open('link.txt', 'r', 'utf-8') as f:
        start_urls = [url.strip() for url in f.readlines()]

def parse(self, response):
    print response.url

In the above code, the 'start_urls' type is a list:

start_urls = [
              example_0.com/?id=0,
              example_0.com/?id=1,
              example_0.com/?id=2,
             ] # and so on

When the scrapy run, the debug info told me:

[scrapy] DEBUG: Redirecting (302) to (GET https://example_1.com/?subid=poison_apple) from (GET http://example_0.com/?id=0)
[scrapy] DEBUG: Redirecting (301) to (GET https://example_1/ture_a.html) from (GET https://example_1.com/?subid=poison_apple)
[scrapy] DEBUG: Crawled (200) (GET https://example_1/ture_a.html) (referer: None)

Now, How can I know which url of 'http://example_0.com/?id=***' in 'start_url' is pair to the url of 'https://example_1/ture_a.html'? Is anyone can help me?

Upvotes: 0

Views: 517

Answers (2)

eLRuLL
eLRuLL

Reputation: 18799

extending the answer, if you want to control every request without being redirected automatically (because a redirect is an extra request), you can disable the RedirectMiddleware or just pass the meta parameters dont_redirect to the request, so in this case:

class Link_Spider(scrapy.Spider):
    name = 'GetLink'
    allowed_domains = ['example_0.com']

    # you'll have to control the initial requests with `start_requests`
    # instead of declaring start_urls

    def start_requests(self):
        with codecs.open('link.txt', 'r', 'utf-8') as f:
            start_urls = [url.strip() for url in f.readlines()]
        for start_url in start_urls:
            yield Request(
                start_url, 
                callback=self.parse_handle1, 
                meta={'dont_redirect':True, 'handle_httpstatus_list': [301, 302]},
            )
    def parse_handle1(self, response):
        # here you'll have to handle the redirect yourself
        # remember that the redirected url is in in the header: `Location`
        # do something with the response.body, response.headers. etc.
        ...
        yield Request(
            response.headers['Location'][0], 
            callback=self.parse_handle2,
            meta={'dont_redirect':True, 'handle_httpstatus_list': [301, 302]},
        )

    def parse_handle2(self, response):
        # here you'll have to handle the second redirect yourself
        # do something with the response.body, response.headers. etc.
        ...
        yield Request(response.headers['Location'][0], callback=self.parse)


    def parse(self, response):
        # actual last response
        print response.url

Upvotes: 1

Granitosaurus
Granitosaurus

Reputation: 21436

Every response has a request attached to it, so you can retrieve the original url from it:

def parse(self, response):
    print('original url:')
    print(response.request.url)

Upvotes: 0

Related Questions