user9013730
user9013730

Reputation:

Python 3: `netloc` value in `urllib.parse` is empty if URL doesn't have `//`

I notice that netloc is empty if the URL doesn't have //.

Without //, netloc is empty

>>> from urllib.parse import urlparse
>>> urlparse('google.com')
ParseResult(scheme='', netloc='', path='google.com', params='', query='', fragment='')
>>>
>>> urlparse('www.google.com')
ParseResult(scheme='', netloc='', path='www.google.com', params='', query='', fragment='')
>>>
>>> urlparse('google.com/search?q=python')
ParseResult(scheme='', netloc='', path='google.com/search', params='', query='q=python', fragment='')
>>>

With //, netloc is identified correctly

>>> urlparse('http://google.com')
ParseResult(scheme='http', netloc='google.com', path='', params='', query='', fragment='')
>>>
>>> urlparse('//google.com')
ParseResult(scheme='', netloc='google.com', path='', params='', query='', fragment='')
>>>
>>> urlparse('http://google.com/search?q=python')
ParseResult(scheme='http', netloc='google.com', path='/search', params='', query='q=python', fragment='')
>>>

Would it be possible to identify netloc correctly even if // not provided in the URL?

Upvotes: 5

Views: 4363

Answers (3)

Um Cara Qualquer
Um Cara Qualquer

Reputation: 61

I usually do something like that:

from urllib.parse import urlparse, ParseResult

def createParser(url: str, default_scheme = 'https') -> ParseResult:
    url = url.strip().strip('/')
    parser = urlparse(url)

    if not parser.netloc:
        parser = createParser(f'{default_scheme}://{url}')

    return parser

parser: ParseResult = createParser('stackoverflow.com/questions/53816559')
print(parser) # ParseResult(scheme='https', netloc='stackoverflow.com', path='/questions/53816559', params='', query='', fragment='')

parser2: ParseResult = createParser('http://stackoverflow.com/questions/53816559')
print(parser2) # ParseResult(scheme='http', netloc='stackoverflow.com', path='/questions/53816559', params='', query='', fragment='')

No need to import ParseResult. I don't know how to explain this code, but it works.

If you give an URL that have no scheme, or doesn't start with an //, it will add and default schema to the URL and restart the process

Upvotes: 0

rpdelaney
rpdelaney

Reputation: 194

I'm working on an application that needs to parse out the scheme and netloc from a URL that might not have any scheme set. I've settled on this approach, although it is smelly and I doubt it will handle every corner case either.

Python 3.8.0 (default, Dec  3 2019, 17:33:19)
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.parse
>>> url="google.com"
>>> o = urllib.parse.urlsplit(url)
>>> u = urllib.parse.SplitResult(
...     scheme=o.scheme if o.scheme else "https",
...     netloc=o.netloc if o.netloc else o.path,
...     path="",
...     query="",
...     fragment=""
... )
>>> urllib.parse.urlunsplit(u)
'https://google.com'
>>>

Upvotes: 0

DeepSpace
DeepSpace

Reputation: 81604

Would it be possible to identify netloc correctly even if // not provided in the URL?

Not by using urlparse. This is explicitly explained in the documentation:

Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by //. Otherwise the input is presumed to be a relative URL and thus to start with a path component.


If you don't want to rewrite urlparse's logic (which I would not suggest), make sure url starts with //:

if not url.startswith('//'):
    url = '//' + url

EDIT

The above is actually a bad solution as @alexis noted. Perhaps

if not (url.startswith('//') or url.startswith('http://') or url.startswith('https://')):
    url = '//' + url

But your mileage may very with that solution as well. If you have to support a wide variety of inconsistent formats you may have to resort to regex.

Upvotes: 7

Related Questions