Danis Fischer
Danis Fischer

Reputation: 375

sending utf-8 adress to urlretrieve in python

While trying to access a file whose name contain utf-8 chars from browser I get the error

The requested URL /images/0/04/×¤×ª×¨×•× ×•×ª_תרגילי×_על_משטחי×_דיפ'_2014.pdf was not found on this server.

Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.`

In order to access the files I wrote the following python script:

# encoding: utf8
__author__ = 'Danis'
__date__ = '20/10/14'

import urllib

curr_link = u'http://math-wiki.com/images/0/04/2014_\'דיפ_משטחים_על_פתרונות.nn uft8pdf'

urllib.urlretrieve(link, 'home/danisf/targil4.pdf')

but when I run the code I get the error URLError:<curr_link appears here> contains non-ASCII characters

How can I fix the code to get him work? (by the way I don't have access to the server or to the webmaster) maybe the browser failed not because the bad encoding of the name for the file?

Upvotes: 2

Views: 2034

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121386

You cannot just pass Unicode URLs into urllib functions; URLs must be valid bytestrings instead. You'll need to encode to UTF-8, then url quote the path of your URL:

import urllib
import urlparse

curr_link = u'http://math-wiki.com/images/0/04/2014_\'דיפ_משטחים_על_פתרונות.nn uft8pdf'
parsed_link = urlparse.urlsplit(curr_link.encode('utf8'))
parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
encoded_link = parsed_link.geturl()

urllib.urlretrieve(encoded_link, 'home/danisf/targil4.pdf')

The specific URL you provided in your question produces a 404 error however.

Demo:

>>> import urllib
>>> import urlparse
>>> curr_link = u'http://math-wiki.com/images/0/04/2014_\'דיפ_משטחים_על_פתרונות.nn uft8pdf'
>>> parsed_link = urlparse.urlsplit(curr_link.encode('utf8'))
>>> parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
>>> print parsed_link.geturl()
http://math-wiki.com/images/0/04/2014_%27%D7%93%D7%99%D7%A4_%D7%9E%D7%A9%D7%98%D7%97%D7%99%D7%9D_%D7%A2%D7%9C_%D7%A4%D7%AA%D7%A8%D7%95%D7%A0%D7%95%D7%AA.nn%20uft8pdf

Your browser usually decodes UTF-8 bytes encoded like this, to present a readable URL, but when sending the URL to the server to retrieve, it is encoded in the exact same manner.

Upvotes: 3

Related Questions