David542
David542

Reputation: 110163

Cannot properly decode string

I have a string from reading a .txt file that looks something like this:

str='\x00I\x00S\x00T\x00A\x00\r\x00\n\x00[\x00/\x00B\x00O\x00D\x00Y\x00]\x00\r\x00\n\x00'

The contents of the file is in Portuguese and it won't allow me to encode into utf-8.

When I do print(str), it comes out properly, but when I try and do stuff with the characters, I get the following error: UnicodeDecodeError: 'utf8' codec can't decode byte.... What do I need to do to get the contents of the string so I can work with it? Thank you.

Edit: actually, the print statement is NOT working correctly, as certain accents are replaced with ? in the print statement.

Upvotes: 0

Views: 467

Answers (1)

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798646

You need to decode it to a unicode first.

>>> '\x00I\x00S\x00T\x00A\x00\r\x00\n\x00[\x00/\x00B\x00O\x00D\x00Y\x00]\x00\r\x00\n'.decode('utf-16be')
u'ISTA\r\n[/BODY]\r\n'

If it's from a file then use codecs.open() instead of open(), passing the appropriate encoding.

Upvotes: 4

Related Questions