Python and cyrillic alphabet

Question

I am working with an excel table using pandas through the ipython. Table contains cyrillic words. When I try to munge the data I am always getting strings of numbers instead of words. It looks like this:

In [16]: report_sorted_geo['country'].unique()
Out[16]: 
array([u'\u0410\u0431\u0445\u0430\u0437\u0438\u044f',
       u'\u0410\u0437\u0435\u0440\u0431\u0430\u0439\u0434\u0436\u0430\u043d',
       u'\u0410\u0440\u043c\u0435\u043d\u0438\u044f',
       u'\u0411\u0435\u043b\u0430\u0440\u0443\u0441\u044c',
       u'\u0412\u044c\u0435\u0442\u043d\u0430\u043c',
       u'\u0413\u0432\u0430\u0442\u0435\u043c\u0430\u043b\u0430',
       u'\u0413\u0435\u0440\u043c\u0430\u043d\u0438\u044f',
       u'\u0413\u043e\u043d\u043a\u043e\u043d\u0433',
       u'\u0413\u0440\u0443\u0437\u0438\u044f',

Is there a fix to this?

When I am simply trying to print something output is alright:

In [17]: print "привет"
привет

Does anyone knows how to fix this?

PM 2Ring · Accepted Answer

Here's one way to convert your lists of strings to make them more readable in Python 2. This code explicitly encodes the Unicode data as utf-8 bytes.

#!/usr/bin/env python

data = [
    u'\u0410\u0431\u0445\u0430\u0437\u0438\u044f',
    u'\u0410\u0437\u0435\u0440\u0431\u0430\u0439\u0434\u0436\u0430\u043d',
    u'\u0410\u0440\u043c\u0435\u043d\u0438\u044f',
    u'\u0411\u0435\u043b\u0430\u0440\u0443\u0441\u044c',
    u'\u0412\u044c\u0435\u0442\u043d\u0430\u043c',
    u'\u0413\u0432\u0430\u0442\u0435\u043c\u0430\u043b\u0430',
    u'\u0413\u0435\u0440\u043c\u0430\u043d\u0438\u044f',
    u'\u0413\u043e\u043d\u043a\u043e\u043d\u0433',
    u'\u0413\u0440\u0443\u0437\u0438\u044f',
]

def list_to_utf8(seq):
    t = ["    u'%s'" % s.encode('utf-8') for s in seq]
    return '[
' + ',
'.join(t) + '
]'

print list_to_utf8(data)

output

[
    u'Абхазия',
    u'Азербайджан',
    u'Армения',
    u'Беларусь',
    u'Вьетнам',
    u'Гватемала',
    u'Германия',
    u'Гонконг',
    u'Грузия'
]

To use this data in your Python code you must give a valid utf-8 encoding declaration at the top of the script, and you also must tell your text editor to save the file with the utf-8 encoding.

test

#!/usr/bin/env python
# -*- coding: utf_8 -*- 

data = [
    u'\u0410\u0431\u0445\u0430\u0437\u0438\u044f',
    u'\u0410\u0437\u0435\u0440\u0431\u0430\u0439\u0434\u0436\u0430\u043d',
    u'\u0410\u0440\u043c\u0435\u043d\u0438\u044f',
    u'\u0411\u0435\u043b\u0430\u0440\u0443\u0441\u044c',
    u'\u0412\u044c\u0435\u0442\u043d\u0430\u043c',
    u'\u0413\u0432\u0430\u0442\u0435\u043c\u0430\u043b\u0430',
    u'\u0413\u0435\u0440\u043c\u0430\u043d\u0438\u044f',
    u'\u0413\u043e\u043d\u043a\u043e\u043d\u0433',
    u'\u0413\u0440\u0443\u0437\u0438\u044f',
]

newdata = [
    u'Абхазия',
    u'Азербайджан',
    u'Армения',
    u'Беларусь',
    u'Вьетнам',
    u'Гватемала',
    u'Германия',
    u'Гонконг',
    u'Грузия'
]

for s1, s2 in zip(data, newdata):
    print s1 == s2, s1, s2

output

True Абхазия Абхазия
True Азербайджан Азербайджан
True Армения Армения
True Беларусь Беларусь
True Вьетнам Вьетнам
True Гватемала Гватемала
True Германия Германия
True Гонконг Гонконг
True Грузия Грузия

Python and cyrillic alphabet

Answers (2)

Related Questions