Python3 Unicode Decode Error

Question

I get UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte

When I try to call codecs.decode(X, 'utf-8') where X = b'\xe8\xd0\xca@\xee\xe4\xca\xc6\xd6@\xde\xcc@\xe8\xd0\xca@\xd0\xca\xe6\xe0\xca\xe4\xea\xe6\x14\xc4\xf2@\xd0\xca\xdc\xe4\xf2@\xee\xc2\xc8\xe6\xee\xde\xe4\xe8\xd0@\xd8\xde\xdc\xce\xcc\xca\xd8\xd8\xde\xee\x14\x14\xd2\xe8@\xee\xc2\xe6@\xe8\xd0\xca@\xe6\xc6\xd0\xde\xde\xdc\xca\xe4@\xd0\xca\xe6\xe0\xca\xe4\xea\xe6\x14@@@@@@\xe8\xd0\xc2\xe8@\xe6\xc2\xd2\xd8\xca\xc8@\xe8\xd0\xca@\xee\xd2\xdc\xe8\xe4\xf2@\xe6\xca\xc2\x14\xc2\xdc\xc8@\xe8\xd0\xca@\xe6\xd6\xd2\xe0\xe0\xca\xe4@\xd0\xc2\xc8@\xe8\xc2\xd6\xca\xdc@\xd0\xd2\xe6@\xd8\xd2\xe8\xe8\xd8\xca@\xc8\xc2\xea\xce\xd0\xe8\xca\xe4\x14@@@@@@\xe8\xde@\xc4\xca\xc2\xe4@\xd0\xd2\xda@\xc6\xde\xda\xe0\xc2\xdc\xf2\\x14\x14\xc4\xd8\xea\xca@\xee\xca\xe4\xca@\xd0\xca\xe4@\xca\xf2\xca\xe6@\xc2\xe6@\xe8\xd0\xca@\xcc\xc2\xd2\xe4\xf2Z\xcc\xd8\xc2\xf0\x14@@@@@@\xd0\xca\xe4@\xc6\xd0\xca\xca\xd6\xe6@\xd8\xd2\xd6\xca@\xe8\xd0\xca@\xc8\xc2\xee\xdc@\xde\xcc@\xc8\xc2\xf2\x14\xc2\xdc\xc8@\xd0\xca\xe4@\xc4\xde\xe6\xde\xda@\xee\xd0\xd2\xe8\xca@\xc2\xe6@\xe8\xd0\xca@\xd0\xc2\xee\xe8\xd0\xde\xe4\xdc@\xc4\xea\xc8\xe6\x14@@@@@@\xe8\xd0\xc2\xe8@\xde\xe0\xca@\xd2\xdc@\xe8\xd0\xca@\xda\xde\xdc\xe8\xd0@\xde\xcc@\xda\xc2\xf2\\x14\x14\xe8\xd0\xca@\xe6\xd6\xd2\xe0\xe0\xca\xe4@\xd0\xca@\xe6\xe8\xde\xde\xc8@\xc4\xca\xe6\xd2\xc8\xca@\xe8\xd0\xca@\xd0\xca\xd8\xda\x14@@@@@@\xd0\xd2\xe6@\xe0\xd2\xe0\xca@\xee\xc2\xe6@\xd2\xdc@\xd0\xd2\xe6@\xda\xde\xea\xe8\xd0\x14\xc2\xdc\xc8@\xd0\xca@\xee\xc2\xe8\xc6\xd0\xca\xc8@\xd0\xde\xee@\xe8\xd0\xca@\xec\xca\xca\xe4\xd2\xdc\xce@\xcc\xd8\xc2\xee@\xc8\xd2\xc8@\xc4\xd8\xde\xee\x14@@@@@@\xe8\xd0\xca@\xe6\xda\xde\xd6\xca@\xdc\xde\xee@\xee\xca\xe6\xe8@\xdc\xde\xee@\xe6\xde\xea\xe8\xd0\\x14\x14\xe8\xd0\xca\xdc@\xea\xe0@\xc2\xdc\xc8@\xe6\xe0\xc2\xd6\xca@\xc2\xdc@\xde\xd8\xc8@\xe6\xc2\xd2\xd8\xde\xe4\x14@@@@@@\xd0\xc2\xc8@\xe6\xc2\xd2\xd8\xca\xc8@\xe8\xde@\xe8\xd0\xca@\xe6\xe0\xc2\xdc\xd2\xe6\xd0@\xda\xc2\xd2\xdc\x14\xd2@\xe0\xe4\xc2\xf2@\xe8\xd0\xca\xca@\xe0\xea\xe8@\xd2\xdc\xe8\xde@\xf2\xde\xdc\xc8\xca\xe4@\xe0\xde\xe4\xe8\x14@@@@@@\xcc\xde\xe4@\xd2@\xcc\xca\xc2\xe4@\xc2@\xd0\xea\xe4\xe4\xd2\xc6\xc2\xdc\xca\\x14\x14\xd8\xc2\xe6\xe8@\xdc\xd2\xce\xd0\xe8@\xe8\xd0\xca@\xda\xde\xde\xdc@\xd0\xc2\xc8@\xc2@\xce\xde\xd8\xc8\xca\xdc@\xe4\xd2\xdc\xce\x14@@@@@@\xc2\xdc\xc8@\xe8\xdeZ\xdc\xd2\xce\xd0\xe8@\xdc\xde@\xda\xde\xde\xdc@\xee\xca@\xe6\xca\xca\x14\xe8\xd0\xca@\xe6\xd6\xd2\xe0\xe0\xca\xe4@\xd0\xca@\xc4'

I also tried to use binascii.unhexlify('%x' % (int('0b' + bNum, 2))).decode('utf-8') where bNum is a long binary string

The text was originally from a utf-8 encoded .txt file

EDIT: Lets say we have two bit strings, the first is the exact bit string from converting some text to a bit string. The second is extracted from an image. The second is exactly the same as the first up to the point where it was cut off because the image it was being hidden in didn't have enough pixels.

example: http://pastebin.com/NnaH9dEb

why would it throw UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte error if they both contain the same data up to the point the second one cuts off?

EDIT2: when I convert the two bit strings to hex via hex(int(, 2)) I get different results, but converting only the first couple of bytes returns the same result.

Mark Tolonen · Accepted Answer

The decode of decMsg is misaligned. If I add 7 zero bits to the end of the message or truncate the last bit, it decodes with my method. Your code was TL;DR.

import math

initMsg = '11101000110100001100101...'  # truncated due post limits.
decMsg = '11101000110100001100101...'

# Only printing the first 25 chars of the message for bevity:

a = int(initMsg,2)
print(a.to_bytes(math.ceil(a.bit_length()/8),'big')[:25])

a = int(decMsg,2)
print(a.to_bytes(math.ceil(a.bit_length()/8),'big')[:25])

a = int(decMsg+'0000000',2)
print(a.to_bytes(math.ceil(a.bit_length()/8),'big')[:25])

a = int(decMsg[:-1],2)
print(a.to_bytes(math.ceil(a.bit_length()/8),'big')[:25])

Output:

b'the wreck of the hesperus'
b'\xe8\xd0\xca@\xee\xe4\xca\xc6\xd6@\xde\xcc@\xe8\xd0\xca@\xd0\xca\xe6\xe0\xca\xe4\xea\xe6'
b'the wreck of the hesperus'
b'the wreck of the hesperus'

Compare \xe8 to t in binary:

>>> format(ord('t'),'08b')
'01110100'
>>> format(0xe8,'08b')
'11101000'

Python3 Unicode Decode Error

Answers (1)

Related Questions