Reputation: 9485
Doing:
from urllib.parse import urljoin
urljoin('https://site/folder', 'page')
Returns https://site/page
. Then it is ok, I can append one /
. But when my variable already has /
and I append another, I got double bars:
urljoin('https://site/folder//', 'page')
>>> 'https://site/folder//page'
Would not be wrong urljoin allowing this double bars //
when joining urls?
How can I join a list of URLs parts like this:
urljoin('https://site/folder', 'page', 'otherpage' )
> https://site/folder/page/otherpage
urljoin('https://site/folder', 'page', 'otherpage.jsf' )
> https://site/folder/page/otherpage.jsf
urljoin('https://site/folder/' , 'page.htm', )
> https://site/folder/page.htm
urljoin('https://site/folder//', '/page', '///otherpage' )
> https://site/folder/page/otherpage
urljoin('https://site/folder//', '//page/', '//otherpage.php' )
> https://site/folder/page/otherpage.php
urljoin('https://site/folder//', 'page', '/otherpage////' )
> https://site/folder/page/otherpage
Upvotes: 3
Views: 2617
Reputation: 6365
I'm sure there are different ways to do it
from urllib.parse import urljoin
from functools import reduce # python3
def clean_url(url):
return url.strip('/') + '/'
def joinurllist(urls):
return reduce(urljoin, map(clean_url, urls))
joinurllist(['https://site/folder//', 'page', '///otherpage/'])
Upvotes: 1
Reputation: 3463
//
... is a legal URI path.
urljoin
checks to see if the previous element has a trailing /
. If it does, it keeps it as a branch and not a leaf.
So:
>>> urljoin('/foo/bar/','page')
'/foo/bar/page'
>>> urljoin('/foo/bar', 'page')
/foo/page
If you want to really avoid extra /
, then rstrip()
and append:
>>> urljoin('/foo/bar/'.rstrip('/'), 'page')
'/foo/page'
>>> urljoin('/foo/bar///'.rstrip('/') + '/', 'page')
'/foo/bar/page'
What you might want to do is something like:
L = ['root', 'part1','/part2/','//part3//']
urljoin([p.rstrip('/') + '/' for p in L])
Upvotes: 2
Reputation: 9485
I wrote this URL join function which does it:
def _clean_urljoin(url):
if url.startswith( '/' ) or url.startswith( ' ' ):
url = url[1:]
url = _clean_urljoin( url )
if url.endswith( '/' ) or url.endswith( ' ' ):
url = url[0:-1]
url = _clean_urljoin( url )
return url
def clean_urljoin(*urls):
fixed_urls = []
for url in urls:
fixed_urls.append( _clean_urljoin(url) )
return "/".join( fixed_urls )
print( clean_urljoin( 'https://site/folder' , 'page' , 'otherpage' ) )
print( clean_urljoin( 'https://site/folder' , 'page' , 'otherpage.jsf' ) )
print( clean_urljoin( 'https://site/folder/' , 'page.htm' , ) )
print( clean_urljoin( 'https://site/folder//' , '/page' , '///otherpage' ) )
print( clean_urljoin( 'https://site/folder//' , '//page/' , '//otherpage.php' ) )
print( clean_urljoin( 'https://site/folder//' , 'page' , '/otherpage////' ) )
Running this returns:
$ python3 test.py
https://site/folder/page/otherpage
https://site/folder/page/otherpage.jsf
https://site/folder/page.htm
https://site/folder/page/otherpage
https://site/folder/page/otherpage.php
https://site/folder/page/otherpage
Upvotes: 2
Reputation: 1351
This behavior is mentioned in the python docs.
Leaving a trailing slash is a reasonable method of appending the appropriate path component.
Upvotes: 2