Reputation: 728
I can be given a string in any of these formats:
url: e.g http://www.acme.com:456
string: e.g www.acme.com:456, www.acme.com 456, or www.acme.com
I would like to extract the host and if present a port. If the port value is not present I would like it to default to 80.
I have tried urlparse, which works fine for the url, but not for the other format. When I use urlparse on hostname:port for example, it puts the hostname in the scheme rather than netloc.
I would be happy with a solution that uses urlparse and a regex, or a single regex that could handle both formats.
Upvotes: 32
Views: 52480
Reputation: 1340
Method using urllib -
from urllib.parse import urlparse
url = 'https://stackoverflow.com/questions'
print(urlparse(url))
Output -
ParseResult(scheme='https', netloc='stackoverflow.com', path='/questions', params='', query='', fragment='')
Reference - https://www.tutorialspoint.com/urllib-parse-parse-urls-into-components-in-python
Upvotes: 5
Reputation: 1736
>>> from urlparse import urlparse
>>> aaa = urlparse('http://www.acme.com:456')
>>> aaa.hostname
'www.acme.com'
>>> aaa.port
456
>>>
Upvotes: 20
Reputation: 10363
You can use urlparse to get hostname from URL string:
from urlparse import urlparse
print urlparse("http://www.website.com/abc/xyz.html").hostname # prints www.website.com
Upvotes: 57
Reputation: 2113
I'm not that familiar with urlparse, but using regex you'd do something like:
p = '(?:http.*://)?(?P<host>[^:/ ]+).?(?P<port>[0-9]*).*'
m = re.search(p,'http://www.abc.com:123/test')
m.group('host') # 'www.abc.com'
m.group('port') # '123'
Or, without port:
m = re.search(p,'http://www.abc.com/test')
m.group('host') # 'www.abc.com'
m.group('port') # '' i.e. you'll have to treat this as '80'
EDIT: fixed regex to also match 'www.abc.com 123'
Upvotes: 8
Reputation: 10221
The reason it fails for:
www.acme.com 456
is because it is not a valid URI. Why don't you just:
:
urlparse
methodTry and make use of default functionality as much as possible, especially when it comes to things like parsing well know formats like URI's.
Upvotes: 6