Reputation: 1122
I have tried a number of solutions and I have read many websites and I cannot seem to solve this. I have a file that contain message objects. Each message has a 4-byte value that is the message type, a 4-byte value that is the length and then the message data which is ASCII in Unicode. When I print to the screen it looks like ASCII. When I direct the output to a file I get Unicode so something is not right with the way I am trying to decode all this. Here is the python script:
import sys
import codecs
import encodings.idna
import unicodedata
def getHeader(fileObj):
mstype_array = bytearray(4)
mslen_array = bytearray(4)
mstype = 0
mslen = 0
fileObj.seek(-1, 1)
mstype_array = fileObj.read(4)
mslen_array = fileObj.read(4)
mstype = int.from_bytes(mstype_array, byteorder=sys.byteorder)
mslen = int.from_bytes(mslen_array, byteorder=sys.byteorder)
return mstype,mslen
def getMessage(fileObj, count):
str = fileObj.read(count)#.decode("utf-8", "strict")
return str
def getFields(msg):
msg = codecs.decode(msg, 'utf-8')
fields = msg.split(';')
return fields
mstype = 0
mslen = 0
with open('../putty.log', 'rb') as f:
while True:
byte = f.read(1)
if not byte:
break
if byte == b'\x1D':
mstype, mslen = getHeader(f)
print (f"Msg Type: {mstype} Msg Len: {mslen}")
msg = getMessage(f, mslen)
print(f"Message: {codecs.decode(msg, 'utf-8')}")
#print(type(msg))
fields = getFields(msg)
print("Fields:")
for field in fields:
print(field)
else:
print (f"Char read: {byte} {hex(ord(byte))}")
Use can use this link to get the file to decode.
Upvotes: 1
Views: 2144
Reputation: 42477
In short, define a custom function and use it everywhere you were calling print
.
import sys
def ascii_print(txt):
sys.stdout.buffer.write(txt.encode('ascii', errors='backslashreplace'))
ASCII is a subset of utf-8. The ACSII characters are indistinguishable from the same utf-8 encoded characters. Internally, all Python strings are raw Unicode. However, raw Unicode cannot be read in or written out. They must be encoded to some encoding first. By default, on most systems the default encoding is utf-8, which is the most common standard for encoding Unicode.
If you want to write out using a different encoding, then you must specify that encoding. I'm assuming you need the ascii
encoding for some reason.
Note that the documentation for print states:
Since printed arguments are converted to text strings,
print()
cannot be used with binary mode file objects. For these, usefile.write(...)
instead.
Now if you are redirecting stdout
, you can call write()
in sys.stdout directly. However, as the docs explain there:
To write or read binary data from/to the standard streams, use the underlying binary
buffer
object. For example, to write bytes tostdout
, usesys.stdout.buffer.write(b'abc')
.
Therefore, rather than the line print(f"Message: {codecs.decode(msg, 'utf-8')}")
, you might do:
ascii_msg = f"Message: {codecs.decode(msg, 'utf-8')}".encode('ascii')
sys.stdout.buffer.write(ascii_msg)
Note that I specifically called str.encode, on the string and explicitly set the ascii
encoding. Also note that I encoded the entire string (including the Message:
), not just the variable passed in (which still needs to be decoded). You then need to write that ASCII encoded byte string directly to sys.stdout.buffer
as is demonstrated on the second line.
The one issue with this is that its possible that the input will contain some non-ASCII characters. As is, a Unicodeerror
would occur and the program would crash. To avoid this, str.encode
supports a few different options for handling errors:
Other possible values are
'ignore'
,'replace'
,'xmlcharrefreplace'
,'backslashreplace'
and any other name registered viacodecs.register_error()
.
As the target output is plain text, 'backslashreplace'
is probably the best way to maintain lossless output. However, 'ignore'
would work too if you don't care about preserving the non-ASCII characters.
ascii_msg = f"Message: {codecs.decode(msg, 'utf-8')}".encode('ascii', errors='backslashreplace')
sys.stdout.buffer.write(ascii_msg)
And yes, you will need to do that for every string you send to print
. It might make sense to define a custom print function which keeps the code more readable:
def ascii_print(txt):
sys.stdout.buffer.write(txt.encode('ascii', errors='backslashreplace'))
And then in your code you could just call that rather than print
:
ascii_print(f"Message: {codecs.decode(msg, 'utf-8')}")
Upvotes: 1
Reputation: 6281
It appears that sys.stdout
is behaving differently when writing to the console vs writing to a file. The manual (https://docs.python.org/3/library/sys.html#sys.stdout) says that this is expected, but only gives details for Windows.
In any case, you are writing unicode to stdout (via print()
), which is why you get unicode in the file. You can avoid this by not decoding the message in getFields
(so you could replace fields = getFields(msg)
with fields = msg.split(b';')
and writing to stdout using sys.stdout.buffer.write(field+b'\n')
.
There are apparently some issues mixing print()
and sys.stdout.buffer.write()
, so Python 3: write binary to stdout respecting buffering may be worth reading.
tl;dr - try writing the bytes without decoding to unicode at all.
Upvotes: 1