YKY
YKY

Reputation: 2693

Python and cyrillic alphabet

I am working with an excel table using pandas through the ipython. Table contains cyrillic words. When I try to munge the data I am always getting strings of numbers instead of words. It looks like this:

In [16]: report_sorted_geo['country'].unique()
Out[16]: 
array([u'\u0410\u0431\u0445\u0430\u0437\u0438\u044f',
       u'\u0410\u0437\u0435\u0440\u0431\u0430\u0439\u0434\u0436\u0430\u043d',
       u'\u0410\u0440\u043c\u0435\u043d\u0438\u044f',
       u'\u0411\u0435\u043b\u0430\u0440\u0443\u0441\u044c',
       u'\u0412\u044c\u0435\u0442\u043d\u0430\u043c',
       u'\u0413\u0432\u0430\u0442\u0435\u043c\u0430\u043b\u0430',
       u'\u0413\u0435\u0440\u043c\u0430\u043d\u0438\u044f',
       u'\u0413\u043e\u043d\u043a\u043e\u043d\u0433',
       u'\u0413\u0440\u0443\u0437\u0438\u044f',

Is there a fix to this?

When I am simply trying to print something output is alright:

In [17]: print "привет"
привет

Does anyone knows how to fix this?

Upvotes: 1

Views: 8490

Answers (2)

PM 2Ring
PM 2Ring

Reputation: 55499

Here's one way to convert your lists of strings to make them more readable in Python 2. This code explicitly encodes the Unicode data as utf-8 bytes.

#!/usr/bin/env python

data = [
    u'\u0410\u0431\u0445\u0430\u0437\u0438\u044f',
    u'\u0410\u0437\u0435\u0440\u0431\u0430\u0439\u0434\u0436\u0430\u043d',
    u'\u0410\u0440\u043c\u0435\u043d\u0438\u044f',
    u'\u0411\u0435\u043b\u0430\u0440\u0443\u0441\u044c',
    u'\u0412\u044c\u0435\u0442\u043d\u0430\u043c',
    u'\u0413\u0432\u0430\u0442\u0435\u043c\u0430\u043b\u0430',
    u'\u0413\u0435\u0440\u043c\u0430\u043d\u0438\u044f',
    u'\u0413\u043e\u043d\u043a\u043e\u043d\u0433',
    u'\u0413\u0440\u0443\u0437\u0438\u044f',
]

def list_to_utf8(seq):
    t = ["    u'%s'" % s.encode('utf-8') for s in seq]
    return '[\n' + ',\n'.join(t) + '\n]'

print list_to_utf8(data)

output

[
    u'Абхазия',
    u'Азербайджан',
    u'Армения',
    u'Беларусь',
    u'Вьетнам',
    u'Гватемала',
    u'Германия',
    u'Гонконг',
    u'Грузия'
]

To use this data in your Python code you must give a valid utf-8 encoding declaration at the top of the script, and you also must tell your text editor to save the file with the utf-8 encoding.

test

#!/usr/bin/env python
# -*- coding: utf_8 -*- 

data = [
    u'\u0410\u0431\u0445\u0430\u0437\u0438\u044f',
    u'\u0410\u0437\u0435\u0440\u0431\u0430\u0439\u0434\u0436\u0430\u043d',
    u'\u0410\u0440\u043c\u0435\u043d\u0438\u044f',
    u'\u0411\u0435\u043b\u0430\u0440\u0443\u0441\u044c',
    u'\u0412\u044c\u0435\u0442\u043d\u0430\u043c',
    u'\u0413\u0432\u0430\u0442\u0435\u043c\u0430\u043b\u0430',
    u'\u0413\u0435\u0440\u043c\u0430\u043d\u0438\u044f',
    u'\u0413\u043e\u043d\u043a\u043e\u043d\u0433',
    u'\u0413\u0440\u0443\u0437\u0438\u044f',
]

newdata = [
    u'Абхазия',
    u'Азербайджан',
    u'Армения',
    u'Беларусь',
    u'Вьетнам',
    u'Гватемала',
    u'Германия',
    u'Гонконг',
    u'Грузия'
]

for s1, s2 in zip(data, newdata):
    print s1 == s2, s1, s2    

output

True Абхазия Абхазия
True Азербайджан Азербайджан
True Армения Армения
True Беларусь Беларусь
True Вьетнам Вьетнам
True Гватемала Гватемала
True Германия Германия
True Гонконг Гонконг
True Грузия Грузия

Upvotes: 3

xnx
xnx

Reputation: 25550

Since your terminal supports it, why not just print each element of your array and let Python take care of the glyphs:

In [49]: for e in a:
   ....:     print e
   ....:     
Абхазия
Азербайджан
Армения
Беларусь
Вьетнам
Гватемала
Германия
Гонконг
Грузия

Upvotes: 1

Related Questions