Sergio P.
Sergio P.

Reputation: 68

Python requests head, just get redirected url and don't follow redirections

Im stuck on this. I want to get just the url of a redirected link, I don't need to follow the redirections and visit each link, just get the latest url.

I have this code

try:
    r=requests.head(link,headers={"User-Agent":"Mozilla/5.0"},timeout=20, allow_redirects=True)

except requests.exceptions.RequestException as error:
    print(error)

print(r.url)

And with some urls I get this message:

requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='www.parafarmaciaweb.com', port=443): Max retries exceeded with url: /isdin-capsulas-solares-sun-defense-duplo-2x30-capsulas.html?gclid=EAIaIQobChMImL2c8Yzm6gIVSLTtCh0DIwyGEAkYCiABEgLNI_D_BwE (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc758ce9550>, 'Connection to www.parafarmaciaweb.com timed out. (connect timeout=20)'))

I just need the url and it is in the error: https://www.parafarmaciaweb.com/isdin-capsulas-solares-sun-defense-duplo-2x30-capsulas.html?gclid=EAIaIQobChMImL2c8Yzm6gIVSLTtCh0DIwyGEAkYCiABEgLNI_D_BwE

I could scrape the url from the error, but there must be a way just to get the URL and skip this error, isn't it?

Any ideas how to get this url without scraping it?

Thanks in advance.

Upvotes: 1

Views: 857

Answers (2)

Sergio P.
Sergio P.

Reputation: 68

The answer of Rakesh Nair didn't fully work in my case because there were several redirections in each URL and that code just gave me the first one, so I've solved this by using a two levels strategy. First, I allow_redirects (True) and try to solve the redirection, just in case an exception occurs I use the Rakesh version. The code below works like a charm for me.

Thanks Rakesh!

    #-------------------------------------------------------------------------------
def resolveLink(link):

    final_link=''
    resolve_error=0
    reply=''
    try:
        r=requests.head(link,timeout=20, headers={"User-Agent":"Mozilla/5.0"}, allow_redirects=True)
        reply=r.url
    except requests.exceptions.RequestException as error:
        try:
            r=requests.head(link,timeout=20, headers={"User-Agent":"Mozilla/5.0"}, allow_redirects=False)
            reply=r.headers['Location']
        except requests.exceptions.RequestException as error:
            resolve_error=1

    if not resolve_error:
        final_link=re.sub("\?.*$","",reply)

    return final_link

The line:

final_link=re.sub("\?.*$","",reply)

Removes exchange tokens from the URL, such as:

gclid=EAIaIQobChMImL2c8Yzm6gIVSLTtCh0DIwyGEAkYCiABEgLNI_D_BwE

Upvotes: 0

javaDeveloper
javaDeveloper

Reputation: 1439

Set allow_redirects to false and get the redirected url from headers['Location']

import requests
r = requests.head(link, allow_redirects=False)
print(r.status_code, r.headers['Location'])

Upvotes: 1

Related Questions