Reputation: 3999
I want to be able to take a shortened or non-shortened URL and return its un-shortened form. How can I make a python program to do this?
Additional Clarification:
e.g. bit.ly/silly
in the input array should be google.com
in the output array
e.g. google.com
in the input array should be google.com
in the output array
Upvotes: 23
Views: 31449
Reputation: 9
This Is very easy task you just need to add 4 lines of codes thats it :)
import requests
url = input('Enter url : ')
site = requests.get(url)
print(site.url)
just run this code you will successfully unshort the url.
Upvotes: 0
Reputation: 15642
You can use geturl()
from urllib.request import urlopen
url = "bit.ly/silly"
unshortened_url = urlopen(url).geturl()
print(unshortened_url)
# google.com
Upvotes: 1
Reputation: 4284
If you are using Python 3.5+ you can use the Unshortenit module that makes this very easy:
from unshortenit import UnshortenIt
unshortener = UnshortenIt()
uri = unshortener.unshorten('https://href.li/?https://example.com')
Upvotes: 5
Reputation: 247
To unshort, you can use requests. This is a simple solution that works for me.
import requests
url = "http://foo.com"
site = requests.get(url)
print(site.url)
Upvotes: 4
Reputation: 6877
Unshorten.me has an api that lets you send a JSON or XML request and get the full URL returned.
Upvotes: 5
Reputation: 61
Here a src code that takes into account almost of the useful corner cases:
The src code is on github @ https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
Upvotes: 1
Reputation: 1943
Using requests:
import requests
session = requests.Session() # so connections are recycled
resp = session.head(url, allow_redirects=True)
print(resp.url)
Upvotes: 34
Reputation: 4109
http://github.com/stef/urlclean
sudo pip install urlclean
urlclean.unshorten(url)
Upvotes: 1
Reputation: 400692
Send an HTTP HEAD request to the URL and look at the response code. If the code is 30x, look at the Location
header to get the unshortened URL. Otherwise, if the code is 20x, then the URL is not redirected; you probably also want to handle error codes (4xx and 5xx) in some fashion. For example:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path)
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return response.getheader('Location')
else:
return url
Upvotes: 40
Reputation: 49063
Open the url and see what it resolves to:
>>> import urllib2
>>> a = urllib2.urlopen('http://bit.ly/cXEInp')
>>> print a.url
http://www.flickr.com/photos/26432908@N00/346615997/sizes/l/
>>> a = urllib2.urlopen('http://google.com')
>>> print a.url
http://www.google.com/
Upvotes: 4