diegoaguilar
diegoaguilar

Reputation: 8376

How to crawl shortened urls and get the actual domain in python?

I'm catching tweets from Twitter API, many times tweets contain shortened URLS, so it's really important to get the actual URL they take to.

For example, for http://t.co/3hwXTqmktt which takes to http://www.animalpolitico.com/2014/04/304037/#axzz2yETrXxui I need to obtain animalpolitico.com

The most important thing is to get the domain, so if I have for example:

http://news.example.com 

http://blog.example.com/eeaWdada5das

http://example.com/ewdaD585Jz

I obtain: example.com for each.

I guess any such curl for python will help. How can I achieve this?

Upvotes: 1

Views: 670

Answers (3)

alecxe
alecxe

Reputation: 474281

In order to extract domain name from the url, besides urlparse, you can use tldextract module:

>>> import tldextract
>>> urls = ['http://news.example.com', 
            'http://blog.example.com/eeaWdada5das', 
            'http://example.com/ewdaD585Jz']
>>> for url in urls:
...     data = tldextract.extract(url)
...     print '{0}.{1}'.format(data.domain, data.suffix)
... 
example.com
example.com
example.com

UPD (example for com.mx):

>>> data = tldextract.extract('http://example.com.mx')
>>> print '{0}.{1}'.format(data.domain, data.suffix)
example.com.mx

Upvotes: 2

James Scholes
James Scholes

Reputation: 7926

This applies to Twitter and t.co links specifically, but tweet objects retrieved through the API have what are called entities attached to them. You'll find the original, expanded version of all URLs contained in a tweet in these entities. For more info, see: https://dev.twitter.com/docs/entities

Upvotes: 1

metatoaster
metatoaster

Reputation: 18978

You might want to look into the requests library.

>>> r = requests.get('http://t.co/3hwXTqmktt')
>>> requests.url
>>> r.url
u'http://www.animalpolitico.com/2014/04/304037/#axzz2yETrXxui'

Now that you got the url, you can use urlparse to get the components you need.

Upvotes: 4

Related Questions