Reputation: 16870
I currently have serious problems with coding/encoding under Linux (Ubuntu). I never needed to deal with that before, so I don't have any idea why this actually doesn't work!
I'm parsing *.desktop
files from /usr/share/applications/ and extracting information which is shown in the Web browser via a HTTPServer. I'm using jinja2
for templating.
First, I received UnicodeDecodeError
at the call to jinja2.Template.render()
which said that
utf-8 cannot decode character XXX at position YY [...]
So I have made all values that come from my appfind
-module (which parses the *.desktop
files) returning only unicode-strings.
The problem at this place was solved so far, but at some point I am writing a string returned by a function to the BaseHTTPServer.BaseHTTTPRequestHandler.wfile
slot, and I can't get this error fixed, no matter what encoding I use.
At this point, the string that is written to wfile
comes from jinja2.Template.render()
which, afaik, returns a unicode object.
The bizarre part is, that it is working on my Ubuntu 12.04 LTS but not on my friend's Ubuntu 11.04 LTS. However, that might not be the reason. He has a lot more applications and maybe they do use encodings in their *.desktop
files that raise the error.
However, I properly checked for the encoding in the *.desktop
files:
data = dict(parser.items('Desktop Entry'))
try:
encoding = data.get('encoding', 'utf-8')
result = {
'name': data['name'].decode(encoding),
'exec': DKENTRY_EXECREPL.sub('', data['exec']).decode(encoding),
'type': data['type'].decode(encoding),
'version': float(data.get('version', 1.0)),
'encoding': encoding,
'comment': data.get('comment', '').decode(encoding) or None,
'categories': _filter_bool(data.get('categories', '').
decode(encoding).split(';')),
'mimetypes': _filter_bool(data.get('mimetype', '').
decode(encoding).split(';')),
}
# ...
Can someone please enlighten me about how I can fix this error? I have read on an answer on SO that I should use unicode()
always, but that would be so much pain to implemented, and I don't think it would fix the problem when writing to wfile
?
Thanks,
Niklas
Upvotes: 0
Views: 1313
Reputation: 375584
You need to take a disciplined approach to your byte strings and Unicode strings. This explains it all: Pragmatic Unicode, or, How Do I Stop the Pain?
Upvotes: 1
Reputation: 413
Try using .encode(encoding)
instead of .decode(encoding)
in all its occurences in the snippet.
Upvotes: 0
Reputation: 14900
By default, when python hits an encoding issue with unicde, it throws an error. However, this behavior can be modified, such as if the error is expected or not important.
Say you are converting between two unicode pages that are supersets of ascii. The both have mostly the same characters, but there is no one-to-one correspondence. Therefore, you would want to ignore errors.
To do so, use the errors
variable in the encode function.
mystring = u'This is a test'
print mystring.encode('utf-8', 'ignore')
print mystring.encode('utf-8', 'replace')
print mystring.encode('utf-8', 'xmlcharrefreplace')
print mystring.encode('utf-8', 'backslashreplace')
There are lots of issues with unicode if the wrong encodings are used when reading/writing. Make sure that after you get the unicode string, you convert it to the form of unicode desired by jinja2.
If this doesn't help, could you please add the second error you see, with perhaps a code snippet to clarify what's going on?
Upvotes: 0
Reputation: 2260
This is probably obvious, but anyway: wfile is an ordinary byte stream: everything written must be unicode.encode():ed when written to it.
Reading OP, it is not clear to me what, exactly is afoot. However, there are some tricks that may help you, that I have found to be helpful to debug encoding problems. I appologize in advance if this is stuff you have long since transcended.
cat -v
on a file will output all non-ascii characters as '^X' which is the only fool-proof way I have found to decide what encoding a file really has. UTF-8 non-ascii characters are multi-byte. That means that they will be sequences of more than one '^'-entry by cat -v
.
Shell environment (LC_ALL, et al) is in my experience the most common cause of problems. Make sure you have a system that has locales with both UTF-8 and e.g. latin-1 available. Always set your LC_ALL to a locale that explicitly names an encoding, e.g. LC_ALL=sv_SE.iso88591
.
In bash and zsh, you can run a command with specific environment changes for that command, like so:
$ LC_ALL=sv_SE.utf8 python ./foo.py
This makes it a lot easier to test than having to export different locales, and you won't pollute the shell.
Don't assume that you have unicode strings internally. Write assert statements that verify that strings are unicode.
assert isinstance(foo, unicode)
Learn to recognize mangled/misrepresented versions of common characters in the encodings you are working with. E.g. '\xe4' is latin-1 a diaresis and 'ä' are the two UTF-8 bytes, that make up a diaresis, misstakenly represented in latin-1. I have found that knowing this sort of gorp cuts debugging encoding issues considerably.
Upvotes: 3