Reputation: 2766
I'm writing a script to find out which full URLs a large number of shortened URLs lead to. I'm using the requests module to follow redirects and get the URL one would end up at if entering the URL in a browser. This works for almost all link shorteners, but fails for URLs form disq.us for reasons I can't figure out (i.e. for disq.us URL's I get the same url I enter, whereas when I enter it in a browser, I get redirected)
Below is a snippet which correctly resolves a bit.ly-shortened link but fails with a disq.us-link. I run it with Python 3.6.4 and version 2.18.4 of the requests module. SO will not allow me to include shortened URLs in the question, so I'll leave those in a comment.
import requests
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
url1 = "SOME BITLY URL"
url2 = "SOME DISQ.US URL"
for url in [url1, url2]:
s = requests.Session()
s.headers['User-Agent'] = user_agent
r = s.get(url, allow_redirects=True, timeout=10)
print(r.url)
Upvotes: 4
Views: 8553
Reputation: 2634
Your first URL is a 404 for me. Interestingly, I just tried this with the second url and it worked, but I used a different user agent. Then I tried it with your user agent, and it isn't redirecting.
This suggests that the webserver is doing something strange in response to that user agent string, and that the problem isn't with requests
.
>>> import requests
>>> user_agent = 'foo'
>>> url = 'THE_DISCUS_URL'
>>> s = requests.Session()
>>> s.headers['User-Agent'] = user_agent
>>> r = s.get(url, allow_redirects=True, timeout=10)
>>> r.url
'https://www.elsevier.com/connect/could-dissolvable-microneedles-replace-injected-vaccines'
vs.
>>> import requests
>>> user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
>>> url = 'THE_DISCUS_URL'
>>> s = requests.Session()
>>> s.headers['User-Agent'] = user_agent
>>> r = s.get(url, allow_redirects=True, timeout=10)
>>> r.url
'THE_DISCUS_URL'
I got curious, so I investigated a little more. The actual content of the response is a noscript tag with the link, and some javascript that does the redirect.
What's probably going on here is that if discus sees a real webbrowser user agent, it tries to redirect via javascript (and probably do a bunch of tracking in the process). On the other hand, if the user agent isn't familiar, the site assumes the visitor is a script, which probably can't do javascript, and just redirects.
Upvotes: 6