iso-8859-1 unicode conversion anomaly

Question

I transmit the following data through ethernet

 unsigned int  test_value[ROW][COLUMN] = {
       {0x00, 0x00, 0x00, 0x01} ,
       {0x40, 0x00, 0x00, 0x01} , /*  initializers for row indexed by 0 */
       {0x80, 0x01, 0x81, 0x20} , /*  initializers for row indexed by 1 */
       {0x82, 0x52, 0x83, 0xff}   /*  initializers for row indexed by 2 */
    };

while receiving i use iso-8859-1 decoding to decode the data. code:

import socket
import os
import sys
import binascii
import codecs
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("147.83.49.195", 7))
listening = True

f=open("eth.bin","w+")
f1=open("eth.txt","w+")
data1=[]
while listening:
    data = sock.recv(65536).decode('iso-8859-1')
    #data=binascii.unhexlify(data)
    #for d in data:
        #d=data.decode('cp1254')
    if data:
        print(data)
        #print(addr)

        #data1.append(data)

        f=open("eth.bin","a+")
        f.write(str(data))
        f1=open("eth.txt","a+")
        f1.write(str(data))
    else:
        listening=False
#print(data1)
sock.close()

When I view the received data, every data which is greater than 0x7f is received as two 8 bit data. i.e, if I transmit 0xff, it is received as \xc3 \xbf.

Is there a way to decode 0xff as \xff and also 0x00 as \x00 at the same time? Should i be using any other decoding technique? I view the received data in the terminal by running this code:

fo=open("eth.bin","rb")
#f1=open("data.txt","w+")
data=fo.read()

print(data)


text= ' '.join('{:02x}'.format(b) for b in data)
print(text)

The content of the .bin file:

\00\00\00@\00\00 Rÿ

which gives the following result:

Received data in the terminal:
b'\x00\x00\x00\x01@\x00\x00\x01\xc2\x80\x01\xc2\x81 \xc2\x82R\xc2\x83\xc3\xbf'
00 00 00 01 40 00 00 01 c2 80 01 c2 81 20 c2 82 52 c2 83 c3 bf

Looking for any suggestion.

Tom Dalton · Accepted Answer

@TobySpeight is correct, you are decode('iso-8859-1')-ing the received binary data from the socket into Python strings. Your binary 0xFF character decodes into the string character ÿ. You are then writing these strings to a text-mode file. Python uses UTF-8 implicitly for text mode files. The character ÿ is represented in UTF-8 by the 2-byte sequence [0xc3, 0xbf], which is what you see at the end of your file when you view it.

It sounds like you don't actually want to decode the data received on the socket, or perhaps you want to re-encode it to 'iso-8859-1' when writing your file?

In the first case:

f = open("eth.bin","ab+")
f.write(data.encode("iso-8859-1"))

will re-convert the decoded string data back to binary for writing to the binary file. Alternatively you can still open the file in text mode and tell python to use "iso-8859-1" instead of the default/implicit UTF-8 encoding:

f = open("eth.bin", "a+", encoding="iso-8859-1")
f.write(data)

iso-8859-1 unicode conversion anomaly

Answers (1)

Related Questions