hjmnzs
hjmnzs

Reputation: 141

Decoding a list of encoded strings

I am working on a publication dataset, and after extracting data from a xml file, I got a list of this kind:

['21-10-2013', ['título do artigo'],['álvaro', 'joão', 'márcio'],['teste', 'operação','manobras']]

As you can see the words are in portuguese. In order to convert to unicode I tried a code from Kumar McMillan, I got in farmdev.com/talks/unicode/. Here is the code:

>>> def to_unicode_or_bust(obj, encoding='utf-8'):
...     if isinstance(obj, basestring):
...         if not isinstance(obj, unicode):
...             obj = unicode(obj, encoding)
...     return obj
... 

I tried the code in a simple string: ab= "trabalhar com a imaginação"

The output:

>>> cd=to_unicode_or_bust(ab)
u'trabalhar com a imagina\xe7\xe3o'

If I issue the print command:

>>> print cd
trabalhar com a imaginação

Ok, that seems all right. But how can I apply to the list? Naïve try:

>>> lista2 = to_unicode_or_bust(lista1)
>>> print lista2
['21-10-2013', ['t\xc3\xadtulo do artigo'], ['\xc3\xa1lvaro', 'jo\xc3\xa3o', 'm\xc3\xa1rcio'], ['teste', 'opera\xc3\xa7\xc3\xa3o', 'manobras']]

May be it's a newbie question, I know, but what should I do in order to get the correct portuguese characters in lista2?

Upvotes: 1

Views: 1548

Answers (2)

Mark Ransom
Mark Ransom

Reputation: 308412

The function you have is just fine, but it only works on a single string at a time - if it's passed something other than a non-unicode string, it just returns it. You're passing it a list and it comes back to you unchanged.

This recursive function should go through every bit and reassemble it with the converted strings.

def convert_all(all, convert=to_unicode_or_bust):
    if isinstance(all, tuple):
        return tuple(convert_all(piece, convert) for piece in all)
    elif isinstance(all, list):
        return [convert_all(piece, convert) for piece in all]
    return convert(all)

Be aware that when you print a list, the strings within the list will have Unicode characters shown with a \x-- or \u---- format, but the individual strings will print correctly.

Upvotes: 2

PersianGulf
PersianGulf

Reputation: 2935

if you have string itself you can use the following syntax:

mystring = u'سلام'

if you don't have u'' python ignore unicode base.

But print command: you should use :

print str.decode('utf-8')

your variables:

mystring = unicode(myvar)

Upvotes: -1

Related Questions