Python 3 string decode

Question

I read data from Apache log file. There are some texts are encoded. Like this line:

192.168.1.17 - - [04/Aug/2016:18:45:00 +0800] "GET /d/?q=\xa9\xfa\xa4\xd1\xb7|\xa7\xf3\xa6n HTTP/1.1" 302 3734 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

I want to decode '\xa9\xfa\xa4\xd1\xb7|\xa7\xf3\xa6n'.

In python 2, I use the code:

print(line.decode('string-escape').decode('big5'))

The result:

明天會更好

But I can't write the right code in python 3.

I try to use the code:

with open('access.log', 'r') as f:
    line = f.read()
    print(bytes(line, 'latin-1').decode('big5'))

The result:

\xa9\xfa\xa4\xd1\xb7|\xa7\xf3\xa6n

Or this code:

with open('access.log', 'rb') as f:
    line = f.read()
    print(line.decode('big5'))

The result:

\xa9\xfa\xa4\xd1\xb7|\xa7\xf3\xa6n

It seems because read form file with Python 3, the '\x' become '\x'. So if someone help me to resolve this problem? Thank you.

jsbueno · Accepted Answer

If you have the "\xDD" in a file it is different than if they are in Python code - in Python code, the "\xDD" sequence is translated at compile time, and in the program memory, just the byte represented by the Hex digits "DD" is kept. If you read the "x\DD" sequence from a file, in the program memory there will be four bytes - one for each ASCII character of the sequence - so for "\xa9" you have in memory the characters "\", "x", "a", "9" ('compile time' in Python is a transparent step that happens when one runs the program).

So, if you've read a sequence that in Python3, when printed to your terminal show you a sequence like "\xa9\xfa" when you should be seeing "明" you have to do this:

Transparently convert the string to a bytes object (using the latin1 codec) -(or read your file as a bytes object, opening it in binary mode to start with)
Decode your object back to text using the "unicode_escape" codec. This will parse the "\xDD" sequences into single bytes in memory.
Transparently convert your unicode object into bytes (yes, again) - this time instead of four characters "\,x,a,9" the bytes object will have a single 0xa9 (169) byte in the memory position.
Decode from this bytes object to a string again, this time using the big5 decoding. There you are - you have a string object (text) with your desired chinese characters,

This last str object that is printable in any terminal or GUI interface that supports the characters (the printing interface should do the last encoding conversion transparently from the Python string). If you want to write those characters to a file, using the BIG5 encoding, pass that encoding explictly when opening the file to write. (Or use utf-8, depending on your system).

SO, in code, that is:

with open('access.log', 'r') as f:
    line = f.read()
    step1 = line.encode("latin1")
    step2 = step1.decode("unicode_escape")
    step3 = step2.encode("latin1")
    final_text = step3.decode("big5")
    print(final_text)

TL;DR In Python3, the "string_scape" codec is "unicode_escape" - but you have to apply it decoding a bytes object to start with.

Python 3 string decode

Answers (1)

Related Questions