Reputation: 2765
I'm wondering how to get the Unicode representation of Arabic strings like سلام
in Python?
The result should be \u0633\u0644\u0627\u0645
I need that so that I can compare texts retrieved from mysql db and data stored in redis cache.
Upvotes: 4
Views: 2554
Reputation: 55489
Assuming you have an actual Unicode string, you can do
# -*- coding: utf-8 -*-
s = u'سلام'
print s.encode('unicode-escape')
output
\u0633\u0644\u0627\u0645
The # -*- coding: utf-8 -*-
directive is purely to tell the interpreter that the source code is UTF-8 encoded, it has no bearing on how the script itself handles Unicode.
If your script is reading that Arabic string from a UTF-8 encoded source, the bytes will look like this:
\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85
You can convert that to Unicode like this:
data = '\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
s = data.decode('utf8')
print s
print s.encode('unicode-escape')
output
سلام
\u0633\u0644\u0627\u0645
Of course, you do need to make sure that your terminal is set up to handle Unicode properly.
Note that
'\u0633\u0644\u0627\u0645'
is a plain (byte) string containing 24 bytes, whereas
u'\u0633\u0644\u0627\u0645'
is a Unicode string containing 4 Unicode characters.
You may find this article helpful: Pragmatic Unicode, which was written by SO veteran Ned Batchelder.
Upvotes: 3
Reputation: 808
Prepend your string with u
in python 2.x, which makes your string a unicode string. Then you can call the encode
method of a unicode string.
arabic_string = u'سلام'
arabic_string.encode('utf-8')
Output:
print arabic_string.encode('utf-8')
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
Upvotes: 0
Reputation: 82058
Since you're using Python 2.x, you'll not be able to use encode
. You'll need to use the unicode
function to cast the string to a unicode object.
> f='سلام'
> f
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
> unicode(f, 'utf-8') # note: you need to pass the encoding parameter in or you'll
# keep having the same problem.
u'\u0633\u0644\u0627\u0645'
> print unicode(f, 'utf-8')
سلام
I'm not sure what library you're using to fetch the content, but you might be able to fetch the data as unicode initially.
> f = u'سلام'
> f
u'\u0633\u0644\u0627\u0645'
> print f.encode('unicode-escape')
\u0633\u0644\u0627\u0645
> print f
سلام
Upvotes: 1