Reputation: 8376
I'm catching tweets from Twitter API, many times tweets contain shortened URLS, so it's really important to get the actual URL they take to.
For example, for http://t.co/3hwXTqmktt which takes to http://www.animalpolitico.com/2014/04/304037/#axzz2yETrXxui I need to obtain animalpolitico.com
The most important thing is to get the domain, so if I have for example:
http://news.example.com
http://blog.example.com/eeaWdada5das
http://example.com/ewdaD585Jz
I obtain: example.com
for each.
I guess any such curl
for python will help. How can I achieve this?
Upvotes: 1
Views: 670
Reputation: 474281
In order to extract domain name from the url, besides urlparse, you can use tldextract module:
>>> import tldextract
>>> urls = ['http://news.example.com',
'http://blog.example.com/eeaWdada5das',
'http://example.com/ewdaD585Jz']
>>> for url in urls:
... data = tldextract.extract(url)
... print '{0}.{1}'.format(data.domain, data.suffix)
...
example.com
example.com
example.com
UPD (example for com.mx
):
>>> data = tldextract.extract('http://example.com.mx')
>>> print '{0}.{1}'.format(data.domain, data.suffix)
example.com.mx
Upvotes: 2
Reputation: 7926
This applies to Twitter and t.co links specifically, but tweet objects retrieved through the API have what are called entities attached to them. You'll find the original, expanded version of all URLs contained in a tweet in these entities. For more info, see: https://dev.twitter.com/docs/entities
Upvotes: 1
Reputation: 18978
You might want to look into the requests
library.
>>> r = requests.get('http://t.co/3hwXTqmktt')
>>> requests.url
>>> r.url
u'http://www.animalpolitico.com/2014/04/304037/#axzz2yETrXxui'
Now that you got the url, you can use urlparse
to get the components you need.
Upvotes: 4