Reputation: 110163
I have a string from reading a .txt file that looks something like this:
str='\x00I\x00S\x00T\x00A\x00\r\x00\n\x00[\x00/\x00B\x00O\x00D\x00Y\x00]\x00\r\x00\n\x00'
The contents of the file is in Portuguese and it won't allow me to encode into utf-8.
When I do print(str)
, it comes out properly, but when I try and do stuff with the characters, I get the following error: UnicodeDecodeError: 'utf8' codec can't decode byte...
. What do I need to do to get the contents of the string so I can work with it? Thank you.
Edit: actually, the print statement is NOT working correctly, as certain accents are replaced with ?
in the print statement.
Upvotes: 0
Views: 467
Reputation: 798646
You need to decode it to a unicode
first.
>>> '\x00I\x00S\x00T\x00A\x00\r\x00\n\x00[\x00/\x00B\x00O\x00D\x00Y\x00]\x00\r\x00\n'.decode('utf-16be')
u'ISTA\r\n[/BODY]\r\n'
If it's from a file then use codecs.open()
instead of open()
, passing the appropriate encoding.
Upvotes: 4