ratijas
ratijas

Reputation: 724

python urllib.parse.urljoin on path starting with numbers and colon

Excuse me, what the heck?

>>> import urllib.parse
>>> base = 'http://example.com'
>>> urllib.parse.urljoin(base, 'abc:123')
'http://example.com/abc:123'
>>> urllib.parse.urljoin(base, '123:abc')
'123:abc'
>>> urllib.parse.urljoin(base + '/', './123:abc')
'http://example.com/123:abc'

python3.7 documentation says:

Changed in version 3.5: Behaviour updated to match the semantics defined in RFC 3986.

Which part of that RFC enforces such madness, and whether it should be considered a bug?

Upvotes: 5

Views: 2309

Answers (1)

gdlmx
gdlmx

Reputation: 6789

Which part of that RFC enforces such madness?

This behavior is correct and consistent with other implementations, as indicated by RFC3986:

A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name. Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative-path reference.

It's been already discussed in another post:

Colons are allowed in the URI path. But you need to be careful when writing relative URI paths with a colon since it is not allowed when used like this:

<a href="tag:sample">

In this case tag would be interpreted as the URI’s scheme. Instead you need to write it like this:

<a href="./tag:sample">

Usage of urljoin

The function urljoin simply treats both arguments as URL (without any presumption). It requires that their schemes to be identical or the second one to represent a relative URI path. Otherwise, it only returns the second argument (although, IMHO, it should raise an error). You can better understand the logic by looking into the source of urljoin.

def urljoin(base, url, allow_fragments=True):
    """Join a base URL and a possibly relative URL to form an absolute
    interpretation of the latter."""
    ...
    bscheme, bnetloc, bpath, bparams, bquery, bfragment = \
            urlparse(base, '', allow_fragments)
    scheme, netloc, path, params, query, fragment = \
            urlparse(url, bscheme, allow_fragments)

    if scheme != bscheme or scheme not in uses_relative:
        return _coerce_result(url)

The results of the parser routine urlparse are as follow:

>>> from urllib.parse import urlparse
>>> urlparse('123:abc')
ParseResult(scheme='123', netloc='', path='abc', params='', query='', fragment='')
>>> urlparse('abc:123')
ParseResult(scheme='', netloc='', path='abc:123', params='', query='', fragment='')
>>> urlparse('abc:a123')
ParseResult(scheme='abc', netloc='', path='a123', params='', query='', fragment='')

Upvotes: 4

Related Questions