Unable to get the redirected URLs in Python. Tried using requests, urllib, urllib2, and mechanize

I have a huge list of URLs which redirect to different URLs. I am supplying them in for loop from a list, and trying to print the redirected URLs

The first redirected URL prints fine. But from the second one - requests just stops giving me redirected URLs, and just prints the given URL

I tried implementing with urllib, urllib2, and mechanize.

They give the first redirected url fine, and then throws an error at second one and stops.

Can anyone please let me know why this is happening?

Below is the pseudo code/implementation:

for given_url in url_list:
    print ("Given URL: " + given_url)
    s = requests.Session()
    r = requests.get(given_url, allow_redirects=True)
    redirected_url = r.url
    print ("Redirected URL: " + redirected_url)

Output:

Given URL: www.xyz.com 
Redirected URL: www.123456789.com 
Given URL: www.abc.com 
Redirected URL: www.abc.com 
Given URL: www.pqr.com 
Redirected URL: www.pqr.com

Upvotes: 1

Answers (2)

pguardiario

Reputation: 54984

Try a HEAD request, it won't follow redirects or download the entire body:

r = requests.head('http://www.google.com/')
print r.headers['Location']

Upvotes: 1

Imran

Reputation: 13458

There is nothing wrong with the code snippet you provided, but as you mentioned in the comments you are getting HTTP 400 and 401 responses. HTTP 401 means Unauthorized, which means the site is blocking you. HTTP 400 means Bad Request which typically means the site doesn't understand your request, but it can also be returned when you are being blocked, which I suspect is the case on those too.

When I run your code for the ABC website I get redirected properly, which leads me to believe they are blocking your ip address for sending too many requests in a short period of time and/or for having no User-Agent set.

Since you mentioned you can open the links correctly in a browser, you can try setting your User-Agent string to match that of a browser, however this is not guaranteed to work since it is one of many parameters a site may use to detect whether you are a bot or not.

For example:

headers = {'User-agent': 'Mozilla/5.0'}
r = requests.get(url, headers=headers)

Upvotes: 0

Unable to get the redirected URLs in Python. Tried using requests, urllib, urllib2, and mechanize

Answers (2)

Related Questions