Minions
Minions

Reputation: 5477

Expand short urls in python using requests library

I have a large number of short URLs and I want to expand them. I found somewhere online (I missed the source) the following code:

short_url = "t.co/NHBbLlfCaa"
r = requests.get(short_url)
if r.status_code == 200:
    print("Actual url:%s" % r.url)

It works perfectly. But I get this error when I ping the same server for many times:

urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.fatlossadvice.pw', port=80): Max retries exceeded with url: /TIPS/KILLED-THAT-TREADMILL-WORKOUT-WORD-TO-TIMMY-GACQUIN.ASP (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))

I tried many solutions like the set here: Max retries exceeded with URL in requests, but nothing worked.

I was thinking about another solution, which is to pass an useragent in the request, and each time I change it randomly (using a large number of useragents):

user_agent_list = [
        'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) Gecko/20100101 Firefox/25.0',
        'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0',
        'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36',
    ]

r = requests.get(short_url, headers={'User-Agent': user_agent_list[np.random.randint(0, len(user_agent_list))]})
if r.status_code == 200:
    print("Actual url:%s" % r.url)

My problem is that r.url always return the short url instead of the long one (the expanded one).

What am I missing?

Upvotes: 0

Views: 1104

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195438

You can prevent the error by adding allow_redirects=False to requests.get() method to prevent redirecting to page that doesn't exist (and thus raising the error). You have to examine the header sent by server yourself (replace XXXX by https, remove spaces):

import requests

short_url = ["XXXX t.co /namDL4YHYu",
 'XXXX t.co /MjvmV',
 'XXXX t.co /JSjtxfaxRJ',
 'XXXX t.co /xxGSANSE8K',
 'XXXX t.co /ZRhf5gWNQg']

for url in short_url:
    r = requests.get(url, allow_redirects=False)
    try:
        print(url, r.headers['location'])
    except KeyError:
        print(url, "Page doesn't exist!")

Prints:

XXXX t.co/namDL4YHYu http://gottimechillinaround.tumblr.com/post/133931725110/tip-672
XXXX t.co/MjvmV Page doesn't exist!
XXXX t.co/JSjtxfaxRJ http://www.youtube.com/watch?v=rE693eNyyss
XXXX t.co/xxGSANSE8K http://www.losefattips.pw/Tips/My-stretch-before-and-after-my-workout-is-just-as-important-to-me-as-my-workout.asp
XXXX .co/ZRhf5gWNQg http://www.youtube.com/watch?v=3OK1P9GzDPM

Upvotes: 1

Related Questions