Reputation: 1049
I'm making an app that parses html and gets images from it. Parsing is easy using Beautiful Soup and downloading of the html and the images works too with urllib2.
I do have a problem with urlparse to make absolute paths out of relative ones. The problem is best explained with an example:
>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'
As you can see, urlparse doesn't take away the ../ away. This gives a problem when I try to download the image:
HTTPError: HTTP Error 400: Bad Request
Is there a way to fix this problem in urllib?
Upvotes: 2
Views: 2264
Reputation: 414905
If you'd like that /../test
would mean the same as /test
like paths in a file system then you could use normpath()
:
>>> url = urlparse.urljoin("http://example.com/", "../test")
>>> p = urlparse.urlparse(url)
>>> path = posixpath.normpath(p.path)
>>> urlparse.urlunparse((p.scheme, p.netloc, path, p.params, p.query,p.fragment))
'http://example.com/test'
Upvotes: 1
Reputation: 13957
I think the best you can do is to pre-parse the original URL, and check the path component. A simple test is
if len(urlparse.urlparse(baseurl).path) > 1:
Then you can combine it with the indexing suggested by demas. For example:
start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])
This way, you will not attempt to go to the parent of the root URL.
Upvotes: 2
Reputation: 45335
urlparse.urljoin("http://www.example.com/", "../test.png"[2:])
It is what you need?
Upvotes: 0
Reputation: 2439
".." would bring you up one directory ("." is current directory), so combining that with a domain name url doesn't make much sense. Maybe what you need is:
>>> urlparse.urljoin("http://www.example.com","./test.png")
'http://www.example.com/test.png'
Upvotes: 3