Reputation: 113

Strange decoding after using split function of python( eg: \x00)

This was a very strange situation, the split function is changing string format. Please look at the below code,

Code:

COM_Port = serial.Serial(COM_PortName)
with COM_Port as port:
    while True:
         RxedData = port.readline()
         line = RxedData.decode('utf-8')
         print("Line 1: ", line)
         row = line.split(',')[1:-1]
         print("Line 2: ", row)

Output:

Line 1: "* , 0 0 0 0 0 5 7 5 , 2 3 : 0 3 : 4 7 , 1 1 / 0 2 / 2 0 , 1 2 . 3 4 5 , K P A , 0 0 0 0 6 . 8 3 , S L P M , T B ,                 , $ "

Line 2: ['\x000\x000\x000\x000\x000\x006\x002\x001\x00', '\x002\x000\x00:\x004\x006\x00:\x005\x001\x00', '\x001\x002\x00/\x000\x002\x00/\x002\x000\x00', '\x001\x002\x00.\x003\x004\x005\x00', '\x00K\x00P\x00A\x00', '\x000\x000\x000\x000\x000\x00.\x000\x000\x00', '\x00C\x00C\x00P\x00M\x00', '\x00T\x00G\x00', '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00']

How the Line 2, get into \x000\x000...? What is this encoding format? How to get it into the right format?

Edit 1:

print([hex(i) for i in RxedData])

Output:

['0x2a', '0x0', '0x2c', '0x0', '0x30', '0x0', '0x30', '0x0', '0x30', '0x0', '0x30', '0x0', '0x30', '0x0', '0x30', '0x0', '0x30', '0x0', '0x31', '0x0', '0x2c', '0x0', '0x31', '0x0', '0x31', '0x0', '0x3a', '0x0', '0x35', '0x0', '0x31', '0x0', '0x3a', '0x0', '0x35', '0x0', '0x30', '0x0', '0x2c', '0x0', '0x31', '0x0', '0x33', '0x0', '0x2f', '0x0', '0x30', '0x0', '0x32', '0x0', '0x2f', '0x0', '0x32', '0x0', '0x30', '0x0', '0x2c', '0x0', '0x31', '0x0', '0x32', '0x0', '0x2e', '0x0', '0x33', '0x0', '0x34', '0x0', '0x35', '0x0', '0x2c', '0x0', '0x4b', '0x0', '0x50', '0x0', '0x41', '0x0', '0x2c', '0x0', '0x31', '0x0', '0x32', '0x0', '0x33', '0x0', '0x34', '0x0', '0x35', '0x0', '0x2e', '0x0', '0x36', '0x0', '0x36', '0x0', '0x2c', '0x0', '0x53', '0x0', '0x4c', '0x0', '0x50', '0x0', '0x48', '0x0', '0x2c', '0x0', '0x0', '0x0', '0x0', '0x0', '0x2c', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2c', '0x0', '0x24', '0x0', '0xa']

Upvotes: 1

Answers (2)

Serge Ballesta

Reputation: 148965

Ok, from the hexdump of the recieved bytes, it appears that each ASCII character is followed with a NULL byte (\x00). That is just the UTF-16-LE representation of the characters. The UTF-8 decode just keeps the code points of the initial bytes because all are below 128, leaving all the interleaving nulls. And you cannot simply decode the byte string as UTF-16 (what it is indeed) because you got it through a readline which just stopped after the newline character and has not read the following null one.

If you could read another line, it would probably start with that null character, making the line appearing as UTF-16-BE encoded...

What can be done then?

A trivial workaround is just to get rid of the null characters. If you can be sure that you will only get plain ASCII characters (no accented ones like é, no emoticons, no greek or cyrillic ones, etc.), this would be enough:

     RxedData = port.readline()
     line = RxedData.replace(b'\x00', b'').decode('ascii')
     print("Line 1: ", line)
     row = line.split(',')[1:-1]
     print("Line 2: ", row)

With that values: ['0x2a', '0x0', '0x2c', '0x0', '0x30', '0x0', '0x30', '0x0', '0x30', '0x0', '0x30', '0x0', '0x30', '0x0', '0x30', '0x0', '0x30', '0x0', '0x31', '0x0', '0x2c', '0x0', '0x31', '0x0', '0x31', '0x0', '0x3a', '0x0', '0x35', '0x0', '0x31', '0x0', '0x3a', '0x0', '0x35', '0x0', '0x30', '0x0', '0x2c', '0x0', '0x31', '0x0', '0x33', '0x0', '0x2f', '0x0', '0x30', '0x0', '0x32', '0x0', '0x2f', '0x0', '0x32', '0x0', '0x30', '0x0', '0x2c', '0x0', '0x31', '0x0', '0x32', '0x0', '0x2e', '0x0', '0x33', '0x0', '0x34', '0x0', '0x35', '0x0', '0x2c', '0x0', '0x4b', '0x0', '0x50', '0x0', '0x41', '0x0', '0x2c', '0x0', '0x31', '0x0', '0x32', '0x0', '0x33', '0x0', '0x34', '0x0', '0x35', '0x0', '0x2e', '0x0', '0x36', '0x0', '0x36', '0x0', '0x2c', '0x0', '0x53', '0x0', '0x4c', '0x0', '0x50', '0x0', '0x48', '0x0', '0x2c', '0x0', '0x0', '0x0', '0x0', '0x0', '0x2c', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2d', '0x0', '0x2c', '0x0', '0x24', '0x0', '0xa'], you should obtain:

Line 1:  *,00000001,11:51:50,13/02/20,12.345,KPA,12345.66,SLPH,,--------,$

Line 2:  ['00000001', '11:51:50', '13/02/20', '12.345', 'KPA', '12345.66', 'SLPH', '', '--------']

The good point with it is that is simple and robust provided you only have plain ASCII

The encoding conformant wait would be to use a TextIOWrapper around the serial port, and specify the UTF-16-LE encoding in it. I could not test it (no Serial on my box and no need for it) so only guessing what should be done.

COM_Port = serial.Serial(COM_PortName)
with io.TextIOWrapper(io.BufferedRWPair(COM_Port, COM_Port), encoding = 'utf-16-le') as port:
    while True:
         line = port.readline()
         print("Line 1: ", line)
         row = line.split(',')[1:-1]
         print("Line 2: ", row)

Here, the TextIOWrapper will take care of the null byte following the newline byte, and will give you directly true unicode strings.

Upvotes: 5

b_c

Reputation: 1212

Decided to turn my comment into an answer (mostly so I could include code).

The two lines are printing differently because you're printing different things. On Line 1, you're printing a string directly, so the print function (or maybe the console itself, not sure) are able to display the ASCII bytes as their characters. On Line 2, you're now printing a list, so the byte interpretation doesn't happen during the print.

Your line string, after the decode, very likely has \x00 (NULL) bytes embedded in it instead of ASCII spaces (\x20).

>>> x = '*\x00,\x000\x000\x000\x000\x000\x005\x007\x005\x00'
>>> print(x)
'* , 0 0 0 0 0 5 7 5 ,'
>>> print(x.split(','))
['*\x00', '\x000\x000\x000\x000\x000\x005\x007\x005\x00']

To amend my quoted comment, this appears to be based on whatever console is printing the characters. I get the above output from cmd and PowerShell, but Jupyter Notebook instead prints this: *,00000575. Note the "spaces" are now gone.

If I change a few of the \x00 bytes to \x20 instead, Jupyter will then print what you're seeing above (in the positions where they were replaced at least). This is just to show that NULL characters and Space characters can visually look identical, depending on the console displaying them.

>>> x = '*\x20,\x200\x200\x000\x000\x000\x005\x007\x005\x00'
>>> print(x)
* , 0 0000575

Edit for your comment:

How to make it interprets correctly?

It depends what "correctly" means to you. In essence, everything has been interpreted correctly - your serial port is just sending across NULL bytes instead of space characters.

If you would rather have ASCII spaces instead of NULL bytes though, you can do a simple replace on the string (printed from Jupyter, which displays NULLs as nothing). You can also just use ' ' instead of '\x20' if you prefer.

>>> print(x.replace('\x00', '\x20').split(','))
['* ', ' 0 0 0 0 0 5 7 5 ']

Upvotes: 3

Strange decoding after using split function of python( eg: \x00)

Answers (2)

Related Questions