Reputation: 141
I am working on a publication dataset, and after extracting data from a xml file, I got a list of this kind:
['21-10-2013', ['título do artigo'],['álvaro', 'joão', 'márcio'],['teste', 'operação','manobras']]
As you can see the words are in portuguese. In order to convert to unicode I tried a code from Kumar McMillan, I got in farmdev.com/talks/unicode/. Here is the code:
>>> def to_unicode_or_bust(obj, encoding='utf-8'):
... if isinstance(obj, basestring):
... if not isinstance(obj, unicode):
... obj = unicode(obj, encoding)
... return obj
...
I tried the code in a simple string: ab= "trabalhar com a imaginação"
The output:
>>> cd=to_unicode_or_bust(ab)
u'trabalhar com a imagina\xe7\xe3o'
If I issue the print command:
>>> print cd
trabalhar com a imaginação
Ok, that seems all right. But how can I apply to the list? Naïve try:
>>> lista2 = to_unicode_or_bust(lista1)
>>> print lista2
['21-10-2013', ['t\xc3\xadtulo do artigo'], ['\xc3\xa1lvaro', 'jo\xc3\xa3o', 'm\xc3\xa1rcio'], ['teste', 'opera\xc3\xa7\xc3\xa3o', 'manobras']]
May be it's a newbie question, I know, but what should I do in order to get the correct portuguese characters in lista2?
Upvotes: 1
Views: 1548
Reputation: 308412
The function you have is just fine, but it only works on a single string at a time - if it's passed something other than a non-unicode string, it just returns it. You're passing it a list
and it comes back to you unchanged.
This recursive function should go through every bit and reassemble it with the converted strings.
def convert_all(all, convert=to_unicode_or_bust):
if isinstance(all, tuple):
return tuple(convert_all(piece, convert) for piece in all)
elif isinstance(all, list):
return [convert_all(piece, convert) for piece in all]
return convert(all)
Be aware that when you print a list
, the strings within the list will have Unicode characters shown with a \x--
or \u----
format, but the individual strings will print correctly.
Upvotes: 2
Reputation: 2935
if you have string itself you can use the following syntax:
mystring = u'سلام'
if you don't have u'' python ignore unicode base.
But print command: you should use :
print str.decode('utf-8')
your variables:
mystring = unicode(myvar)
Upvotes: -1