Ernest A
Ernest A

Reputation: 7839

Why does Emacs get my literal Unicode strings wrong?

As far as I know, these should be equivalent in a system that uses UTF-8 as the default encoding:

pattern1 = 'Wörterbuch Wortformen'.decode('utf8')
pattern2 = u'Wörterbuch Wortformen'

However, when I send these lines from an Emacs buffer to the Python process (M-x python-shell-send-region) something strange happens.

>>> pattern1
u'W\xf6rterbuch Wortformen'
>>> pattern2
u'W\xc3\xb6rterbuch Wortformen'

In a Python shell run in a terminal, both lines result in u'W\xf6rterbuch Wortformen'.

What is going on here?

My locale is configured to use UTF-8.

Upvotes: 1

Views: 375

Answers (2)

Ernest A
Ernest A

Reputation: 7839

It turns out that it was a bug in python.el.

Upvotes: 1

user797257
user797257

Reputation:

Here's what I did (might appear helpful later):

  1. Created a single-bit encoded file, say /tmp/test.dat Opened it in Emacs using hexl-mode.

  2. Using hexl-insert-hex-char command inserted bytes C3 and B6.

  3. Opened this file as text (using text-mode). Emacs recognized it as file with multibyte encoding and displayed ö in place of the previous bytes.


Conclusion: you need the encoding system in your buffer which contains the source code to be utf-8 to send two bytes for ö. However, if it is a single-byte encoding, and given that you select the locale that maps the byte F6 to ö, you will get that byte.

PS. Make sure you have -*- coding: utf-8 -*- comment.

Upvotes: 1

Related Questions