supermario
supermario

Reputation: 2765

How to get the Unicode representation of Arabic strings in Django?

I'm wondering how to get the Unicode representation of Arabic strings like سلام in Python?

The result should be \u0633\u0644\u0627\u0645

I need that so that I can compare texts retrieved from mysql db and data stored in redis cache.

Upvotes: 4

Views: 2554

Answers (4)

PM 2Ring
PM 2Ring

Reputation: 55489

Assuming you have an actual Unicode string, you can do

# -*- coding: utf-8 -*-
s = u'سلام'
print s.encode('unicode-escape')    

output

\u0633\u0644\u0627\u0645

The # -*- coding: utf-8 -*- directive is purely to tell the interpreter that the source code is UTF-8 encoded, it has no bearing on how the script itself handles Unicode.


If your script is reading that Arabic string from a UTF-8 encoded source, the bytes will look like this:

\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85

You can convert that to Unicode like this:

data = '\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
s = data.decode('utf8')
print s
print s.encode('unicode-escape')  

output

سلام
\u0633\u0644\u0627\u0645

Of course, you do need to make sure that your terminal is set up to handle Unicode properly.

Note that

'\u0633\u0644\u0627\u0645'

is a plain (byte) string containing 24 bytes, whereas

u'\u0633\u0644\u0627\u0645'

is a Unicode string containing 4 Unicode characters.

You may find this article helpful: Pragmatic Unicode, which was written by SO veteran Ned Batchelder.

Upvotes: 3

JClarke
JClarke

Reputation: 808

Prepend your string with u in python 2.x, which makes your string a unicode string. Then you can call the encode method of a unicode string.

arabic_string = u'سلام'
arabic_string.encode('utf-8')

Output:

print arabic_string.encode('utf-8')
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'

Upvotes: 0

cwallenpoole
cwallenpoole

Reputation: 82058

Since you're using Python 2.x, you'll not be able to use encode. You'll need to use the unicode function to cast the string to a unicode object.

> f='سلام'
> f
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
> unicode(f, 'utf-8') # note: you need to pass the encoding parameter in or you'll 
                      # keep having the same problem.
u'\u0633\u0644\u0627\u0645'
> print unicode(f, 'utf-8')
سلام

I'm not sure what library you're using to fetch the content, but you might be able to fetch the data as unicode initially.

> f = u'سلام'
> f
u'\u0633\u0644\u0627\u0645'
> print f.encode('unicode-escape')
\u0633\u0644\u0627\u0645
> print f
سلام

Upvotes: 1

Navidad20
Navidad20

Reputation: 832

For python 2.7

string = 'سلام'
new_string = unicode(string)

Upvotes: 0

Related Questions