Reputation: 131228
I have two variables (let's say x
and y
) that have the following values:
x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'
They are presumable encoding the same name but in different way. The first variable is unicode and the second one is a string.
Is there a way to transform string into unicode (or unicode into string) and check if they are really the same.
I try to use encode
x.encode('utf-8')
It returns something new (the third version):
'Ko\xc5\xa1ick\xc3\xbd'
And using the following:
print x.encode('utf-8')
returns yet another version:
KošickÛ
So, I am totally confused. Is there a way to keep everything in the same format?
Upvotes: 0
Views: 3580
Reputation: 178031
You need to know the encoding of the byte string. It looks like windows-1252
:
x = u'Ko\u0161ick\xfd'
y = 'Ko\x9aick\xfd'
print x == y.decode('windows-1252')
print x.encode('windows-1252') == y
Output:
True
True
Best practice is to convert text to Unicode on input to the program, do all the processing in Unicode, and convert back to encoded bytes to persist to storage, transmit on a socket, etc.
Upvotes: 1
Reputation: 189789
You can convert a byte string to Unicode, but if it contains any non-ASCII, characters, you have to specify the encoding.
if y.decode('iso-8859-1') == x:
print(u'{0!r} converted to Unicode == {1}".format(y, x))
With your given example, this is not true; but perhaps y
is in a different encoding.
In theory, you could convert either way, but generally, it makes sense to use all-Unicode internally, and convert other encodings to Unicode for use in your code (not the other way around).
Upvotes: 2
Reputation: 149145
Well, utf-8 is now the de facto standard for interchange and in the Linux world, but there are plenty of other encodings.
Common examples are latin1, latin9 (same with € symbol), and cp1252 a windows variant of them.
In your case:
>>> x.encode('cp1252')
'Ko\x9aick\xfd'
So the y
strings seems to be cp1252
encoded.
Upvotes: 0