Reputation: 375
While trying to access a file whose name contain utf-8 chars from browser I get the error
The requested URL /images/0/04/×¤×ª×¨×•× ×•×ª_תרגילי×_על_משטחי×_דיפ'_2014.pdf was not found on this server.
Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.`
In order to access the files I wrote the following python script:
# encoding: utf8
__author__ = 'Danis'
__date__ = '20/10/14'
import urllib
curr_link = u'http://math-wiki.com/images/0/04/2014_\'דיפ_משטחים_על_פתרונות.nn uft8pdf'
urllib.urlretrieve(link, 'home/danisf/targil4.pdf')
but when I run the code I get the error URLError:<curr_link appears here> contains non-ASCII characters
How can I fix the code to get him work? (by the way I don't have access to the server or to the webmaster) maybe the browser failed not because the bad encoding of the name for the file?
Upvotes: 2
Views: 2034
Reputation: 1121386
You cannot just pass Unicode URLs into urllib
functions; URLs must be valid bytestrings instead. You'll need to encode to UTF-8, then url quote the path of your URL:
import urllib
import urlparse
curr_link = u'http://math-wiki.com/images/0/04/2014_\'דיפ_משטחים_על_פתרונות.nn uft8pdf'
parsed_link = urlparse.urlsplit(curr_link.encode('utf8'))
parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
encoded_link = parsed_link.geturl()
urllib.urlretrieve(encoded_link, 'home/danisf/targil4.pdf')
The specific URL you provided in your question produces a 404 error however.
Demo:
>>> import urllib
>>> import urlparse
>>> curr_link = u'http://math-wiki.com/images/0/04/2014_\'דיפ_משטחים_על_פתרונות.nn uft8pdf'
>>> parsed_link = urlparse.urlsplit(curr_link.encode('utf8'))
>>> parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
>>> print parsed_link.geturl()
http://math-wiki.com/images/0/04/2014_%27%D7%93%D7%99%D7%A4_%D7%9E%D7%A9%D7%98%D7%97%D7%99%D7%9D_%D7%A2%D7%9C_%D7%A4%D7%AA%D7%A8%D7%95%D7%A0%D7%95%D7%AA.nn%20uft8pdf
Your browser usually decodes UTF-8 bytes encoded like this, to present a readable URL, but when sending the URL to the server to retrieve, it is encoded in the exact same manner.
Upvotes: 3