Roman
Roman

Reputation: 131228

How to compare unicode and string in Python?

I have two variables (let's say x and y) that have the following values:

x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'

They are presumable encoding the same name but in different way. The first variable is unicode and the second one is a string.

Is there a way to transform string into unicode (or unicode into string) and check if they are really the same.

I try to use encode

x.encode('utf-8')

It returns something new (the third version):

'Ko\xc5\xa1ick\xc3\xbd'

And using the following:

print x.encode('utf-8')

returns yet another version:

KošickÛ

So, I am totally confused. Is there a way to keep everything in the same format?

Upvotes: 0

Views: 3580

Answers (3)

Mark Tolonen
Mark Tolonen

Reputation: 178031

You need to know the encoding of the byte string. It looks like windows-1252:

x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'

print x == y.decode('windows-1252')
print x.encode('windows-1252') == y

Output:

True
True

Best practice is to convert text to Unicode on input to the program, do all the processing in Unicode, and convert back to encoded bytes to persist to storage, transmit on a socket, etc.

Upvotes: 1

tripleee
tripleee

Reputation: 189789

You can convert a byte string to Unicode, but if it contains any non-ASCII, characters, you have to specify the encoding.

if y.decode('iso-8859-1') == x:
    print(u'{0!r} converted to Unicode == {1}".format(y, x))

With your given example, this is not true; but perhaps y is in a different encoding.

In theory, you could convert either way, but generally, it makes sense to use all-Unicode internally, and convert other encodings to Unicode for use in your code (not the other way around).

Upvotes: 2

Serge Ballesta
Serge Ballesta

Reputation: 149145

Well, utf-8 is now the de facto standard for interchange and in the Linux world, but there are plenty of other encodings.

Common examples are latin1, latin9 (same with € symbol), and cp1252 a windows variant of them.

In your case:

>>> x.encode('cp1252')
'Ko\x9aick\xfd'

So the y strings seems to be cp1252 encoded.

Upvotes: 0

Related Questions