Reputation: 2693
I am working with an excel table using pandas through the ipython. Table contains cyrillic words. When I try to munge the data I am always getting strings of numbers instead of words. It looks like this:
In [16]: report_sorted_geo['country'].unique()
Out[16]:
array([u'\u0410\u0431\u0445\u0430\u0437\u0438\u044f',
u'\u0410\u0437\u0435\u0440\u0431\u0430\u0439\u0434\u0436\u0430\u043d',
u'\u0410\u0440\u043c\u0435\u043d\u0438\u044f',
u'\u0411\u0435\u043b\u0430\u0440\u0443\u0441\u044c',
u'\u0412\u044c\u0435\u0442\u043d\u0430\u043c',
u'\u0413\u0432\u0430\u0442\u0435\u043c\u0430\u043b\u0430',
u'\u0413\u0435\u0440\u043c\u0430\u043d\u0438\u044f',
u'\u0413\u043e\u043d\u043a\u043e\u043d\u0433',
u'\u0413\u0440\u0443\u0437\u0438\u044f',
Is there a fix to this?
When I am simply trying to print something output is alright:
In [17]: print "привет"
привет
Does anyone knows how to fix this?
Upvotes: 1
Views: 8490
Reputation: 55499
Here's one way to convert your lists of strings to make them more readable in Python 2. This code explicitly encodes the Unicode data as utf-8 bytes.
#!/usr/bin/env python
data = [
u'\u0410\u0431\u0445\u0430\u0437\u0438\u044f',
u'\u0410\u0437\u0435\u0440\u0431\u0430\u0439\u0434\u0436\u0430\u043d',
u'\u0410\u0440\u043c\u0435\u043d\u0438\u044f',
u'\u0411\u0435\u043b\u0430\u0440\u0443\u0441\u044c',
u'\u0412\u044c\u0435\u0442\u043d\u0430\u043c',
u'\u0413\u0432\u0430\u0442\u0435\u043c\u0430\u043b\u0430',
u'\u0413\u0435\u0440\u043c\u0430\u043d\u0438\u044f',
u'\u0413\u043e\u043d\u043a\u043e\u043d\u0433',
u'\u0413\u0440\u0443\u0437\u0438\u044f',
]
def list_to_utf8(seq):
t = [" u'%s'" % s.encode('utf-8') for s in seq]
return '[\n' + ',\n'.join(t) + '\n]'
print list_to_utf8(data)
output
[
u'Абхазия',
u'Азербайджан',
u'Армения',
u'Беларусь',
u'Вьетнам',
u'Гватемала',
u'Германия',
u'Гонконг',
u'Грузия'
]
To use this data in your Python code you must give a valid utf-8 encoding declaration at the top of the script, and you also must tell your text editor to save the file with the utf-8 encoding.
test
#!/usr/bin/env python
# -*- coding: utf_8 -*-
data = [
u'\u0410\u0431\u0445\u0430\u0437\u0438\u044f',
u'\u0410\u0437\u0435\u0440\u0431\u0430\u0439\u0434\u0436\u0430\u043d',
u'\u0410\u0440\u043c\u0435\u043d\u0438\u044f',
u'\u0411\u0435\u043b\u0430\u0440\u0443\u0441\u044c',
u'\u0412\u044c\u0435\u0442\u043d\u0430\u043c',
u'\u0413\u0432\u0430\u0442\u0435\u043c\u0430\u043b\u0430',
u'\u0413\u0435\u0440\u043c\u0430\u043d\u0438\u044f',
u'\u0413\u043e\u043d\u043a\u043e\u043d\u0433',
u'\u0413\u0440\u0443\u0437\u0438\u044f',
]
newdata = [
u'Абхазия',
u'Азербайджан',
u'Армения',
u'Беларусь',
u'Вьетнам',
u'Гватемала',
u'Германия',
u'Гонконг',
u'Грузия'
]
for s1, s2 in zip(data, newdata):
print s1 == s2, s1, s2
output
True Абхазия Абхазия
True Азербайджан Азербайджан
True Армения Армения
True Беларусь Беларусь
True Вьетнам Вьетнам
True Гватемала Гватемала
True Германия Германия
True Гонконг Гонконг
True Грузия Грузия
Upvotes: 3
Reputation: 25550
Since your terminal supports it, why not just print
each element of your array and let Python take care of the glyphs:
In [49]: for e in a:
....: print e
....:
Абхазия
Азербайджан
Армения
Беларусь
Вьетнам
Гватемала
Германия
Гонконг
Грузия
Upvotes: 1