intelfx
intelfx

Reputation: 2764

What is the cleanest way to replace a hostname in an URL with Python?

In Python, there is a standard library module urllib.parse that deals with parsing URLs:

>>> import urllib.parse
>>> urllib.parse.urlparse("https://127.0.0.1:6443")
ParseResult(scheme='https', netloc='127.0.0.1:6443', path='', params='', query='', fragment='')

There are also properties on urllib.parse.ParseResult that return the hostname and the port:

>>> p.hostname
'127.0.0.1'
>>> p.port
6443

And, by virtue of ParseResult being a namedtuple, it has a _replace() method that returns a new ParseResult with the given field(s) replaced:

>>> p._replace(netloc="foobar.tld")
ParseResult(scheme='https', netloc='foobar.tld', path='', params='', query='', fragment='')

However, it cannot replace hostname or port because they are dynamic properties rather than fields of the tuple:

>>> p._replace(hostname="foobar.tld")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.11/collections/__init__.py", line 455, in _replace
    raise ValueError(f'Got unexpected field names: {list(kwds)!r}')
ValueError: Got unexpected field names: ['hostname']

It might be tempting to simply concatenate the new hostname with the existing port and pass it as the new netloc:

>>> p._replace(netloc='{}:{}'.format("foobar.tld", p.port))
ParseResult(scheme='https', netloc='foobar.tld:6443', path='', params='', query='', fragment='')

However this quickly turns into a mess if we consider


What is the cleanest, correct way to replace the hostname in a URL in Python?

The solution must handle IPv6 (both as a part of the original URL and as the replacement value), URLs containing username/password, and in general all well-formed URLs.

(There is a wide assortment of existing posts that try to ask the same question, but none of them ask for (or provide) a solution that fits all of the criteria above.)

Upvotes: 1

Views: 256

Answers (2)

blhsing
blhsing

Reputation: 106936

You can first use rpartition('@') on the netloc part of the ParseResult to safely extract the hostinfo portion from netloc.

Since the hostname may contain colons if it's an IPv6 address enclosed in square brackets, and a colon is also what separates a potential port portion from the hostname, it's easier and safer to use a regex alternation pattern to extract the hostname portion from the hostinfo for replacment.

To handle a replacement hostname that is an IPv6 address, wrap it in square brackets if it contains a colon:

import re
import urllib.parse

def url_replace_hostname(url_parse_result, hostname):
    userinfo, sep, hostinfo = url_parse_result.netloc.rpartition('@')
    if ':' in hostname:
        hostname = f'[{hostname}]'
    return url_parse_result._replace(netloc=userinfo + sep +
        re.subn(r'\[[^]]*]|[^:]+', hostname, hostinfo, 1)[0])

so that:

for url, hostname in (
    ('https://127.0.0.1:6443', 'foo'),
    ('http://user.pass@bar', '1.2.3.4'),
    ('http://user:pass@80:80', '1111:1111:1111:1111:1111'),
    ('http://[fe80::822a:a8ff:fe49:470c%tESt]:1234', 'baz.com')
):
    print(url_replace_hostname(urllib.parse.urlparse(url), hostname).geturl())

outputs:

https://foo:6443
http://[email protected]
http://user:pass@[1111:1111:1111:1111:1111]:80
http://baz.com:1234

Demo: https://ideone.com/naRt1C

Upvotes: 0

wim
wim

Reputation: 363163

Nice nerd snipe. Quite difficult to get right.

import urllib.parse
import socket

def is_ipv6(s):
    try:
        socket.inet_pton(socket.AF_INET6, s)
    except Exception:
        return False
    else:
        return True

def host_replace(url, new_host):
    parsed = urllib.parse.urlparse(url)
    _, _, host = parsed.netloc.rpartition("@")
    _, sep, bracketed = host.partition("[")
    if sep:
        host, _, _ = bracketed.partition("]")
        ipv6 = True
    else:
        # ipv4 - might have port suffix
        host, _, _ = host.partition(':')
        ipv6 = False
    new_ipv6 = is_ipv6(new_host)
    if ipv6 and not new_ipv6:
        host = f"[{host}]"
    elif not ipv6 and new_ipv6:
        new_host = f"[{new_host}]"
    port = parsed.port
    netloc = parsed.netloc
    if port is not None:
        netloc = netloc.removesuffix(f":{port}")
    left, sep, right = netloc.rpartition(host)
    new_netloc = left + new_host + right
    if port is not None:
        new_netloc += f":{port}"
    new_url = parsed._replace(netloc=new_netloc).geturl()
    return new_url

I also include my test-cases:

tests = [
    ("https://x.com", "example.org", "https://example.org"),
    ("https://X.com", "example.org", "https://example.org"),
    ("https://x.com/", "example.org", "https://example.org/"),
    ("https://x.com/i.html", "example.org", "https://example.org/i.html"),
    ("https://x.com:8888", "example.org", "https://example.org:8888"),
    ("https://[email protected]:8888", "example.org", "https://[email protected]:8888"),
    ("https://u:[email protected]:8888", "example.org", "https://u:[email protected]:8888"),
    ("https://[::1]:1234", "example.org", "https://example.org:1234"),
    ("https://[::1]:1234", "::2", "https://[::2]:1234"),
    ("https://x.com", "::2", "https://[::2]"),
    ("http://u:p@80:80", "foo", "http://u:p@foo:80"),
]
for url, new_host, expect in tests:
    actual = host_replace(url, new_host)
    assert actual == expect, f"\n{actual=}\n{expect=}"

Upvotes: 1

Related Questions