Reputation: 2726
My code:
a = '汉'
b = u'汉'
These two are the same Chinese character. But obviously, a == b
is False
. How do I fix this? Note, I can't convert a
to utf-8
because I have no access to the code. I need to convert b
to the encoding that a
is using.
So, my question is, what do I do to turn the encoding of b
into that of a
?
Upvotes: 0
Views: 1181
Reputation: 2764
If you don't know a
's encoding, you'll need to:
a
's encodingb
using the detected encodingFirst, to detect a
's encoding, let's use chardet.
$ pip install chardet
Now let's use it:
>>> import chardet
>>> a = '汉'
>>> chardet.detect(a)
{'confidence': 0.505, 'encoding': 'utf-8'}
So, to actually accomplish what you requested:
>>> encoding = chardet.detect(a)['encoding']
>>> b = u'汉'
>>> b_encoded = b.encode(encoding)
>>> a == b_encoded
True
Upvotes: 3
Reputation: 34047
both a.decode
and b.encode
are OK:
In [133]: a.decode('utf') == b
Out[133]: True
In [134]: b.encode('utf') == a
Out[134]: True
Note that str.encode
and unicode.decode
are also available, don't mix them up. See What is the difference between encode/decode?
Upvotes: -1
Reputation: 369344
Decode the encoded string a
using str.decode
:
>>> a = '汉'
>>> b = u'汉'
>>> a.decode('utf-8') == b
True
NOTE Replace utf-8
according to the source code encoding.
Upvotes: 1