alvas
alvas

Reputation: 122052

How else can I strip a domain name from webpage with no additional library - python?

The post Get domain name from URL suggested multiple libraries to get the top level domain. but

how else can I strip a domain name from webpage with no additional library?

I had tried it with regex it seems to work but I am sure there are better ways of doing it and lots of urls that will break the regex:

>>> import re
>>> url = "https://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt"
>>> domain = re.sub("(http://|http://www\\.|www\\.)","",url).split('/')[0]
>>> domain
'stackoverflow.com'
>>> url = "www.apple.com/itune"
>>> re.sub("(http://|http://www\\.|www\\.)","",url).split('/')[0]
>>> 'apple.com'

I've also tried urlparse but it ends up with None:

>>> from urlparse import urlparse
>>> url ='https://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt'
>>> urlparse(url).hostname
'stackoverflow.com'
>>> url = 'www.apple.com/itune'
>>> urlparse(url).hostname
>>> 

Upvotes: 0

Views: 114

Answers (2)

falsetru
falsetru

Reputation: 369084

How about make a function that wraps urlparse ?

>>> from urlparse import urlparse
>>>
>>> def extract_hostname(url):
...     components = urlparse(url)
...     if not components.scheme:
...         components = urlparse('http://' + url)
...     return components.netloc
...
>>> extract_hostname('http://stackoverflow.com/questions/22143342')
'stackoverflow.com'
>>> extract_hostname('www.apple.com/itune')
'www.apple.com'
>>> extract_hostname('file:///usr/bin/python')
''

Upvotes: 2

pbacterio
pbacterio

Reputation: 1152

Use urllib.parse standard library.

>>> from urllib.parse import urlparse
>>> url = 'http://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt'
>>> urlparse(url).hostname
'stackoverflow.com'

Upvotes: 0

Related Questions