Reputation: 65
I have a device that returns a UTF-8 encoded string. I can only read from it byte-by-byte and the read is terminated by a byte of value 0x00.
I'm making a Python 2.7 function for others to access my device and return string.
In a previous design when the device just returned ASCII, I used this in a loop:
x = read_next_byte()
if x == 0:
break
my_string += chr(x)
Where x is the latest byte value read from the device.
Now the device can return a UTF-8 encoded string, but I'm not sure how to convert the bytes that I get back into a UTF-8 encoded string/unicode.
chr(x)
understandably causes an error when the x>127, so I thought that using unichr(x)
may work, but that assumes the value passed is a full unicode character value, but I only have a part 0-255.
So how can I convert the bytes that I get back from the device into a string that can be used in Python and still handle the full UTF-8 string?
Likewise, if I was given a UTF-8 string in Python, how would I break that down into individual bytes to send to my device and still maintain UTF-8?
Upvotes: 5
Views: 3451
Reputation: 155393
The correct solution would be to read until you hit the terminating byte, then convert to UTF-8 at that time (so you have all characters):
mybytes = bytearray()
while True:
x = read_next_byte()
if x == 0:
break
mybytes.append(x)
my_string = mybytes.decode('utf-8')
The above is the most direct translation of your original code. Interestingly, this is one of those cases where two arg iter
can be used to dramatically simplify the code by making your C-style stateful byte reader function into a Python iterator that lets you one-line the work:
# If this were Python 3 code, you'd use the bytes constructor instead of bytearray
my_string = bytearray(iter(read_next_byte, 0)).decode('utf-8')
Upvotes: 4