Reputation: 45
I have a huge list of URLs which redirect to different URLs. I am supplying them in for loop from a list, and trying to print the redirected URLs
The first redirected URL prints fine. But from the second one - requests just stops giving me redirected URLs, and just prints the given URL
I tried implementing with urllib
, urllib2
, and mechanize
.
They give the first redirected url fine, and then throws an error at second one and stops.
Can anyone please let me know why this is happening?
Below is the pseudo code/implementation:
for given_url in url_list:
print ("Given URL: " + given_url)
s = requests.Session()
r = requests.get(given_url, allow_redirects=True)
redirected_url = r.url
print ("Redirected URL: " + redirected_url)
Output:
Given URL: www.xyz.com
Redirected URL: www.123456789.com
Given URL: www.abc.com
Redirected URL: www.abc.com
Given URL: www.pqr.com
Redirected URL: www.pqr.com
Upvotes: 1
Views: 1026
Reputation: 54984
Try a HEAD request, it won't follow redirects or download the entire body:
r = requests.head('http://www.google.com/')
print r.headers['Location']
Upvotes: 1
Reputation: 13458
There is nothing wrong with the code snippet you provided, but as you mentioned in the comments you are getting HTTP 400
and 401
responses. HTTP 401
means Unauthorized
, which means the site is blocking you. HTTP 400
means Bad Request
which typically means the site doesn't understand your request, but it can also be returned when you are being blocked, which I suspect is the case on those too.
When I run your code for the ABC website I get redirected properly, which leads me to believe they are blocking your ip address for sending too many requests in a short period of time and/or for having no User-Agent
set.
Since you mentioned you can open the links correctly in a browser, you can try setting your User-Agent
string to match that of a browser, however this is not guaranteed to work since it is one of many parameters a site may use to detect whether you are a bot or not.
For example:
headers = {'User-agent': 'Mozilla/5.0'}
r = requests.get(url, headers=headers)
Upvotes: 0