Reputation: 122052
The post Get domain name from URL suggested multiple libraries to get the top level domain. but
how else can I strip a domain name from webpage with no additional library?
I had tried it with regex it seems to work but I am sure there are better ways of doing it and lots of urls that will break the regex:
>>> import re
>>> url = "https://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt"
>>> domain = re.sub("(http://|http://www\\.|www\\.)","",url).split('/')[0]
>>> domain
'stackoverflow.com'
>>> url = "www.apple.com/itune"
>>> re.sub("(http://|http://www\\.|www\\.)","",url).split('/')[0]
>>> 'apple.com'
I've also tried urlparse but it ends up with None
:
>>> from urlparse import urlparse
>>> url ='https://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt'
>>> urlparse(url).hostname
'stackoverflow.com'
>>> url = 'www.apple.com/itune'
>>> urlparse(url).hostname
>>>
Upvotes: 0
Views: 114
Reputation: 369084
How about make a function that wraps urlparse
?
>>> from urlparse import urlparse
>>>
>>> def extract_hostname(url):
... components = urlparse(url)
... if not components.scheme:
... components = urlparse('http://' + url)
... return components.netloc
...
>>> extract_hostname('http://stackoverflow.com/questions/22143342')
'stackoverflow.com'
>>> extract_hostname('www.apple.com/itune')
'www.apple.com'
>>> extract_hostname('file:///usr/bin/python')
''
Upvotes: 2
Reputation: 1152
Use urllib.parse standard library.
>>> from urllib.parse import urlparse
>>> url = 'http://stackoverflow.com/questions/22143342/how-else-can-i-strip-a-domain-name-from-webpage-with-no-additional-library-pyt'
>>> urlparse(url).hostname
'stackoverflow.com'
Upvotes: 0