sautedman
sautedman

Reputation: 132

Unicode (Cyrillic) character indexing, re-writing in python

I am working with Russian words written in the Cyrillic orthography. Everything is working fine except for how many (but not all) of the Cyrillic characters are encoded as two characters when in an str. For instance:

>>>print ["ё"]
['\xd1\x91']

This wouldn't be a problem if I didn't want to index string positions or identify where a character is and replace it with another (say "e", without the diaeresis). Obviously, the 2 "characters" are treated as one when prefixed with u, as in u"ё":

>>>print [u"ё"]
[u'\u0451']

But the strs are being passed around as variables, and so can't be prefixed with u, and unicode() gives a UnicodeDecodeError (ascii codec can't decode...).

So... how do I get around this? If it helps, I am using python 2.7

Upvotes: 3

Views: 3123

Answers (3)

jfs
jfs

Reputation: 414715

To convert bytes into Unicode, you need to know the corresponding character encoding and call bytes.decode:

>>> b'\xd1\x91'.decode('utf-8')
u'\u0451'

The encoding depends on the data source. It can be anything e.g., if the data comes from a web page; see A good way to get the charset/encoding of an HTTP response in Python

Don't use non-ascii characters in a bytes literal (it is explicitly forbidden in Python 3). Add from __future__ import unicode_literals to treat all "abc" literals as Unicode literals.

Note: a single user-perceived character may span several Unicode codepoints e.g.:

>>> print(u'\u0435\u0308')
ё

Upvotes: 1

Aaron Hall
Aaron Hall

Reputation: 395673

These are actually different encodings:

>>>print ["ё"]
['\xd1\x91']
>>>print [u"ё"]
[u'\u0451']

What you're seeing is the __repr__'s for the elements in the lists. Not the __str__ versions of the unicode objects.

But the strs are being passed around as variables, and so can't be prefixed with u

You mean the data are strings, and need to be converted into the unicode type:

>>> for c in ["ё"]: print repr(c)
...
'\xd1\x91'

You need to coerce the two-byte strings into double-byte width unicode:

>>> for c in ["ё"]: print repr(unicode(c, 'utf-8'))
...
u'\u0451'

And you'll see with this transform they're perfectly fine.

Upvotes: 1

Borealid
Borealid

Reputation: 98559

There are two possible situations here.

Either your str represents valid UTF-8 encoded data, or it does not.

If it represents valid UTF-8 data, you can convert it to a Unicode object by using mystring.decode('utf-8'). After it's a unicode instance, it will be indexed by character instead of by byte, as you have already noticed.

If it has invalid byte sequences in it... You're in trouble. This is because the question of "which character does this byte represent?" no longer has a clear answer. You're going to have to decide exactly what you mean when you say "the third character" in the presence of byte sequences that don't actually represent a particular Unicode character in UTF-8 at all...

Perhaps the easiest way to work around the issue would be to use the ignore_errors flag to decode(). This will entirely discard invalid byte sequences and only give you the "correct" portions of the string.

Upvotes: 2

Related Questions