Evandro Coan
Evandro Coan

Reputation: 9485

How to build URL paths from a list of parts?

Doing:

from urllib.parse import urljoin
urljoin('https://site/folder', 'page')

Returns https://site/page. Then it is ok, I can append one /. But when my variable already has / and I append another, I got double bars:

urljoin('https://site/folder//', 'page')
>>> 'https://site/folder//page'

Would not be wrong urljoin allowing this double bars // when joining urls?

How can I join a list of URLs parts like this:

urljoin('https://site/folder', 'page', 'otherpage' )
> https://site/folder/page/otherpage

urljoin('https://site/folder', 'page', 'otherpage.jsf' )
> https://site/folder/page/otherpage.jsf

urljoin('https://site/folder/' , 'page.htm', )
> https://site/folder/page.htm

urljoin('https://site/folder//', '/page', '///otherpage' )
> https://site/folder/page/otherpage

urljoin('https://site/folder//', '//page/',  '//otherpage.php'  )
> https://site/folder/page/otherpage.php

urljoin('https://site/folder//', 'page', '/otherpage////' )
> https://site/folder/page/otherpage

Upvotes: 3

Views: 2617

Answers (4)

Julio Daniel Reyes
Julio Daniel Reyes

Reputation: 6365

I'm sure there are different ways to do it

from urllib.parse import urljoin
from functools import reduce # python3

def clean_url(url):
    return url.strip('/') + '/'

def joinurllist(urls):
    return reduce(urljoin, map(clean_url, urls))

joinurllist(['https://site/folder//', 'page', '///otherpage/'])

Upvotes: 1

cowbert
cowbert

Reputation: 3463

//... is a legal URI path.

urljoin checks to see if the previous element has a trailing /. If it does, it keeps it as a branch and not a leaf.

So:

>>> urljoin('/foo/bar/','page')
'/foo/bar/page'

>>> urljoin('/foo/bar', 'page')
/foo/page

If you want to really avoid extra /, then rstrip() and append:

>>> urljoin('/foo/bar/'.rstrip('/'), 'page')
'/foo/page'

>>> urljoin('/foo/bar///'.rstrip('/') + '/', 'page')
'/foo/bar/page'

What you might want to do is something like:

L = ['root', 'part1','/part2/','//part3//']
urljoin([p.rstrip('/') + '/' for p in L])

Upvotes: 2

Evandro Coan
Evandro Coan

Reputation: 9485

I wrote this URL join function which does it:

def _clean_urljoin(url):

    if url.startswith( '/' ) or url.startswith( ' ' ):
        url = url[1:]
        url = _clean_urljoin( url )

    if url.endswith( '/' ) or url.endswith( ' ' ):
        url = url[0:-1]
        url = _clean_urljoin( url )

    return url


def clean_urljoin(*urls):
    fixed_urls = []

    for url in urls:
        fixed_urls.append( _clean_urljoin(url) )

    return "/".join( fixed_urls )


print( clean_urljoin( 'https://site/folder'   , 'page'     , 'otherpage'       ) )
print( clean_urljoin( 'https://site/folder'   , 'page'     , 'otherpage.jsf'   ) )
print( clean_urljoin( 'https://site/folder/'  , 'page.htm' ,                   ) )
print( clean_urljoin( 'https://site/folder//' , '/page'    , '///otherpage'    ) )
print( clean_urljoin( 'https://site/folder//' , '//page/'  , '//otherpage.php' ) )
print( clean_urljoin( 'https://site/folder//' , 'page'     , '/otherpage////'  ) )

Running this returns:

$ python3 test.py
https://site/folder/page/otherpage
https://site/folder/page/otherpage.jsf
https://site/folder/page.htm
https://site/folder/page/otherpage
https://site/folder/page/otherpage.php
https://site/folder/page/otherpage

Upvotes: 2

Jason Fuerstenberg
Jason Fuerstenberg

Reputation: 1351

This behavior is mentioned in the python docs.

Leaving a trailing slash is a reasonable method of appending the appropriate path component.

Upvotes: 2

Related Questions