Unicode conversion issue using Python in Emacs

Question

I'm trying to understand the difference in a bit of Python script behavior when run on the command line vs run as part of an Emacs elisp function.

The script looks like this (I'm using Python 2.7.1 BTW):

import json; t = {"Foo":"ザ"}; print json.dumps(t).decode("unicode_escape")

that is, [in general] take a JSON segment containing unicode characters, dumpstring it to it's unicode escaped version, then decode it back to it's unicode representation. When run on the command line, the dumps part of this returns:

'{"Foo": "\u30b6"}'

which when printed looks like:

'{"Foo": "\u30b6"}'

the decode part of this looks like:

u'{"Foo": "\u30b6"}'

which when printed looks like:

{"Foo": "ザ"}

i.e., the original string representation of the structure, at least in a terminal/console that supports unicode (in my testbed, an xterm). In a Windows console, the output is not correct with respect to the unicode character, but the script does not error out.

In Emacs, the dumps conversion is the same as on the command line (at least as far as confirming with a print), but the decode part blows out with the dreaded:

File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\u30b6' in position 9: ordinal not in range(128)`

I've a feeling I'm missing something basic here with respect to either the script or Emacs (in my testbed 23.1.1). Is there some auto-magic part of print invoking the correct codec/locale that happens at the command line but not in Emacs? I've tried explicitly setting the locale for an Emacs invocation (here's a stub test without the json logic):

"LC_ALL="en_US.UTF-8" python -c 's = u"Fooザ"; print s'"

produces the same exception, while

"LC_ALL="en_US.UTF-8" python -c 'import sys; enc=sys.stdout.encoding; print enc' "

indicates that the encoding is 'None'.

If I attempt to coerce the conversion using:

"LC_ALL="en_US.UTF-8" python -c 's = u"Fooザ"; print s.encode("utf8","replace")'"

the error goes away, but the result is the "garbled" version of the string seen in the non-unicode console:

Fooa?¶

Any ideas?

UPDATE: thanks to unutbu -- b/c the locale identification falls down, the command needs to be explicitly decorated with the utf8-encode (see the answer for working directly with a unicode string). In my case, I am getting what is needed from the dumps/decode sequence, so I add the additional required decoration to achieve the desired result:

import json; t = {"Foo":"ザ"}; print json.dumps(t).decode("unicode_escape").encode("utf8","replace")

Note that this is the "raw" Python without the necessary escaping required by Emacs.

As you may have guessed from looking at the original part of this question, I'm using this as part of some JSON formatting logic in Emacs -- see my answer to this question.

unutbu · Accepted Answer

The Python wiki page, "PrintFails" says

When Python does not detect the desired character set of the output, it sets sys.stdout.encoding to None, and print will invoke the "ascii" codec.

It appears that when python is being run from an elisp function, it can not detect the desired character set, so it is defaulting to "ascii". So trying to print unicode is tacitly causing python to encode the unicode as ascii, which is reason for the error.

Replacing u"Fooザ" with u"Foo\u30b6" seems to work:

(defun mytest ()
  (interactive)
  (shell-command-on-region (point)
         (point) "LC_ALL="en_US.UTF-8" python -c 's = u"Foo\u30b6"; print s.encode("utf8","replace")'" nil t))

C-x C-e M-x mytest

yields

Fooザ

Unicode conversion issue using Python in Emacs

Answers (1)

Related Questions