Reputation: 68
Im stuck on this. I want to get just the url of a redirected link, I don't need to follow the redirections and visit each link, just get the latest url.
I have this code
try:
r=requests.head(link,headers={"User-Agent":"Mozilla/5.0"},timeout=20, allow_redirects=True)
except requests.exceptions.RequestException as error:
print(error)
print(r.url)
And with some urls I get this message:
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='www.parafarmaciaweb.com', port=443): Max retries exceeded with url: /isdin-capsulas-solares-sun-defense-duplo-2x30-capsulas.html?gclid=EAIaIQobChMImL2c8Yzm6gIVSLTtCh0DIwyGEAkYCiABEgLNI_D_BwE (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc758ce9550>, 'Connection to www.parafarmaciaweb.com timed out. (connect timeout=20)'))
I just need the url and it is in the error: https://www.parafarmaciaweb.com/isdin-capsulas-solares-sun-defense-duplo-2x30-capsulas.html?gclid=EAIaIQobChMImL2c8Yzm6gIVSLTtCh0DIwyGEAkYCiABEgLNI_D_BwE
I could scrape the url from the error, but there must be a way just to get the URL and skip this error, isn't it?
Any ideas how to get this url without scraping it?
Thanks in advance.
Upvotes: 1
Views: 857
Reputation: 68
The answer of Rakesh Nair didn't fully work in my case because there were several redirections in each URL and that code just gave me the first one, so I've solved this by using a two levels strategy. First, I allow_redirects (True) and try to solve the redirection, just in case an exception occurs I use the Rakesh version. The code below works like a charm for me.
Thanks Rakesh!
#-------------------------------------------------------------------------------
def resolveLink(link):
final_link=''
resolve_error=0
reply=''
try:
r=requests.head(link,timeout=20, headers={"User-Agent":"Mozilla/5.0"}, allow_redirects=True)
reply=r.url
except requests.exceptions.RequestException as error:
try:
r=requests.head(link,timeout=20, headers={"User-Agent":"Mozilla/5.0"}, allow_redirects=False)
reply=r.headers['Location']
except requests.exceptions.RequestException as error:
resolve_error=1
if not resolve_error:
final_link=re.sub("\?.*$","",reply)
return final_link
The line:
final_link=re.sub("\?.*$","",reply)
Removes exchange tokens from the URL, such as:
gclid=EAIaIQobChMImL2c8Yzm6gIVSLTtCh0DIwyGEAkYCiABEgLNI_D_BwE
Upvotes: 0
Reputation: 1439
Set allow_redirects to false and get the redirected url from headers['Location']
import requests
r = requests.head(link, allow_redirects=False)
print(r.status_code, r.headers['Location'])
Upvotes: 1