Reputation: 91
I using scrapy(ver:1.1.1) for scrapy some date in internet. This is what I face to:
class Link_Spider(scrapy.Spider):
name = 'GetLink'
allowed_domains = ['example_0.com']
with codecs.open('link.txt', 'r', 'utf-8') as f:
start_urls = [url.strip() for url in f.readlines()]
def parse(self, response):
print response.url
In the above code, the 'start_urls' type is a list:
start_urls = [
example_0.com/?id=0,
example_0.com/?id=1,
example_0.com/?id=2,
] # and so on
When the scrapy run, the debug info told me:
[scrapy] DEBUG: Redirecting (302) to (GET https://example_1.com/?subid=poison_apple) from (GET http://example_0.com/?id=0)
[scrapy] DEBUG: Redirecting (301) to (GET https://example_1/ture_a.html) from (GET https://example_1.com/?subid=poison_apple)
[scrapy] DEBUG: Crawled (200) (GET https://example_1/ture_a.html) (referer: None)
Now, How can I know which url of 'http://example_0.com/?id=***' in 'start_url' is pair to the url of 'https://example_1/ture_a.html'? Is anyone can help me?
Upvotes: 0
Views: 517
Reputation: 18799
extending the answer, if you want to control every request without being redirected automatically (because a redirect is an extra request), you can disable the RedirectMiddleware
or just pass the meta parameters dont_redirect
to the request, so in this case:
class Link_Spider(scrapy.Spider):
name = 'GetLink'
allowed_domains = ['example_0.com']
# you'll have to control the initial requests with `start_requests`
# instead of declaring start_urls
def start_requests(self):
with codecs.open('link.txt', 'r', 'utf-8') as f:
start_urls = [url.strip() for url in f.readlines()]
for start_url in start_urls:
yield Request(
start_url,
callback=self.parse_handle1,
meta={'dont_redirect':True, 'handle_httpstatus_list': [301, 302]},
)
def parse_handle1(self, response):
# here you'll have to handle the redirect yourself
# remember that the redirected url is in in the header: `Location`
# do something with the response.body, response.headers. etc.
...
yield Request(
response.headers['Location'][0],
callback=self.parse_handle2,
meta={'dont_redirect':True, 'handle_httpstatus_list': [301, 302]},
)
def parse_handle2(self, response):
# here you'll have to handle the second redirect yourself
# do something with the response.body, response.headers. etc.
...
yield Request(response.headers['Location'][0], callback=self.parse)
def parse(self, response):
# actual last response
print response.url
Upvotes: 1
Reputation: 21436
Every response has a request attached to it, so you can retrieve the original url from it:
def parse(self, response):
print('original url:')
print(response.request.url)
Upvotes: 0