Reputation: 724
Excuse me, what the heck?
>>> import urllib.parse
>>> base = 'http://example.com'
>>> urllib.parse.urljoin(base, 'abc:123')
'http://example.com/abc:123'
>>> urllib.parse.urljoin(base, '123:abc')
'123:abc'
>>> urllib.parse.urljoin(base + '/', './123:abc')
'http://example.com/123:abc'
python3.7 documentation says:
Changed in version 3.5: Behaviour updated to match the semantics defined in RFC 3986.
Which part of that RFC enforces such madness, and whether it should be considered a bug?
Upvotes: 5
Views: 2309
Reputation: 6789
This behavior is correct and consistent with other implementations, as indicated by RFC3986:
A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name. Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative-path reference.
It's been already discussed in another post:
Colons are allowed in the URI path. But you need to be careful when writing relative URI paths with a colon since it is not allowed when used like this:
<a href="tag:sample">
In this case tag would be interpreted as the URI’s scheme. Instead you need to write it like this:
<a href="./tag:sample">
urljoin
The function urljoin
simply treats both arguments as URL (without any presumption). It requires that their schemes to be identical or the second one to represent a relative URI path. Otherwise, it only returns the second argument (although, IMHO, it should raise an error). You can better understand the logic by looking into the source of urljoin.
def urljoin(base, url, allow_fragments=True):
"""Join a base URL and a possibly relative URL to form an absolute
interpretation of the latter."""
...
bscheme, bnetloc, bpath, bparams, bquery, bfragment = \
urlparse(base, '', allow_fragments)
scheme, netloc, path, params, query, fragment = \
urlparse(url, bscheme, allow_fragments)
if scheme != bscheme or scheme not in uses_relative:
return _coerce_result(url)
The results of the parser routine urlparse
are as follow:
>>> from urllib.parse import urlparse
>>> urlparse('123:abc')
ParseResult(scheme='123', netloc='', path='abc', params='', query='', fragment='')
>>> urlparse('abc:123')
ParseResult(scheme='', netloc='', path='abc:123', params='', query='', fragment='')
>>> urlparse('abc:a123')
ParseResult(scheme='abc', netloc='', path='a123', params='', query='', fragment='')
Upvotes: 4