Reputation: 2764
In Python, there is a standard library module urllib.parse
that deals with parsing URLs:
>>> import urllib.parse
>>> urllib.parse.urlparse("https://127.0.0.1:6443")
ParseResult(scheme='https', netloc='127.0.0.1:6443', path='', params='', query='', fragment='')
There are also properties on urllib.parse.ParseResult
that return the hostname and the port:
>>> p.hostname
'127.0.0.1'
>>> p.port
6443
And, by virtue of ParseResult being a namedtuple, it has a _replace()
method that returns a new ParseResult with the given field(s) replaced:
>>> p._replace(netloc="foobar.tld")
ParseResult(scheme='https', netloc='foobar.tld', path='', params='', query='', fragment='')
However, it cannot replace hostname
or port
because they are dynamic properties rather than fields of the tuple:
>>> p._replace(hostname="foobar.tld")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.11/collections/__init__.py", line 455, in _replace
raise ValueError(f'Got unexpected field names: {list(kwds)!r}')
ValueError: Got unexpected field names: ['hostname']
It might be tempting to simply concatenate the new hostname with the existing port and pass it as the new netloc:
>>> p._replace(netloc='{}:{}'.format("foobar.tld", p.port))
ParseResult(scheme='https', netloc='foobar.tld:6443', path='', params='', query='', fragment='')
However this quickly turns into a mess if we consider
https://user:[email protected]
);https://::1
isn't valid but https://[::1]
is);What is the cleanest, correct way to replace the hostname in a URL in Python?
The solution must handle IPv6 (both as a part of the original URL and as the replacement value), URLs containing username/password, and in general all well-formed URLs.
(There is a wide assortment of existing posts that try to ask the same question, but none of them ask for (or provide) a solution that fits all of the criteria above.)
Upvotes: 1
Views: 256
Reputation: 106936
You can first use rpartition('@')
on the netloc
part of the ParseResult
to safely extract the hostinfo
portion from netloc
.
Since the hostname may contain colons if it's an IPv6 address enclosed in square brackets, and a colon is also what separates a potential port portion from the hostname, it's easier and safer to use a regex alternation pattern to extract the hostname portion from the hostinfo for replacment.
To handle a replacement hostname that is an IPv6 address, wrap it in square brackets if it contains a colon:
import re
import urllib.parse
def url_replace_hostname(url_parse_result, hostname):
userinfo, sep, hostinfo = url_parse_result.netloc.rpartition('@')
if ':' in hostname:
hostname = f'[{hostname}]'
return url_parse_result._replace(netloc=userinfo + sep +
re.subn(r'\[[^]]*]|[^:]+', hostname, hostinfo, 1)[0])
so that:
for url, hostname in (
('https://127.0.0.1:6443', 'foo'),
('http://user.pass@bar', '1.2.3.4'),
('http://user:pass@80:80', '1111:1111:1111:1111:1111'),
('http://[fe80::822a:a8ff:fe49:470c%tESt]:1234', 'baz.com')
):
print(url_replace_hostname(urllib.parse.urlparse(url), hostname).geturl())
outputs:
https://foo:6443
http://[email protected]
http://user:pass@[1111:1111:1111:1111:1111]:80
http://baz.com:1234
Demo: https://ideone.com/naRt1C
Upvotes: 0
Reputation: 363163
Nice nerd snipe. Quite difficult to get right.
import urllib.parse
import socket
def is_ipv6(s):
try:
socket.inet_pton(socket.AF_INET6, s)
except Exception:
return False
else:
return True
def host_replace(url, new_host):
parsed = urllib.parse.urlparse(url)
_, _, host = parsed.netloc.rpartition("@")
_, sep, bracketed = host.partition("[")
if sep:
host, _, _ = bracketed.partition("]")
ipv6 = True
else:
# ipv4 - might have port suffix
host, _, _ = host.partition(':')
ipv6 = False
new_ipv6 = is_ipv6(new_host)
if ipv6 and not new_ipv6:
host = f"[{host}]"
elif not ipv6 and new_ipv6:
new_host = f"[{new_host}]"
port = parsed.port
netloc = parsed.netloc
if port is not None:
netloc = netloc.removesuffix(f":{port}")
left, sep, right = netloc.rpartition(host)
new_netloc = left + new_host + right
if port is not None:
new_netloc += f":{port}"
new_url = parsed._replace(netloc=new_netloc).geturl()
return new_url
I also include my test-cases:
tests = [
("https://x.com", "example.org", "https://example.org"),
("https://X.com", "example.org", "https://example.org"),
("https://x.com/", "example.org", "https://example.org/"),
("https://x.com/i.html", "example.org", "https://example.org/i.html"),
("https://x.com:8888", "example.org", "https://example.org:8888"),
("https://[email protected]:8888", "example.org", "https://[email protected]:8888"),
("https://u:[email protected]:8888", "example.org", "https://u:[email protected]:8888"),
("https://[::1]:1234", "example.org", "https://example.org:1234"),
("https://[::1]:1234", "::2", "https://[::2]:1234"),
("https://x.com", "::2", "https://[::2]"),
("http://u:p@80:80", "foo", "http://u:p@foo:80"),
]
for url, new_host, expect in tests:
actual = host_replace(url, new_host)
assert actual == expect, f"\n{actual=}\n{expect=}"
Upvotes: 1