Python: Converting Binary Literal text file to Normal Text

Question

I have a text file in this format:

b'Chapter 1 \xe2\x80\x93 BlaBla'
b'Boy\xe2\x80\x99s Dead.'

And I want to read those lines and covert them to

Chapter 1 - BlaBla
Boy's Dead.

and replace them on the same file. I tried encoding and decoding already with print(line.encode("UTF-8", "replace")) and that didn't work

7stud · Accepted Answer

strings = [
    b'Chapter 1 \xe2\x80\x93 BlaBla',
    b'Boy\xe2\x80\x99s Dead.',
]

for string in strings:
    print(string.decode('utf-8', 'ignore'))

--output:--
Chapter 1 – BlaBla
Boy’s Dead.

and replace them on the same file.

There is no computer programming language in the world that can do that. You have to write the output to a new file, delete the old file, and rename the newfile to the oldfile. However, python's fileinput module can perform that process for you:

import fileinput as fi
import sys

with open('data.txt', 'wb') as f:
    f.write(b'Chapter 1 \xe2\x80\x93 BlaBla
')
    f.write(b'Boy\xe2\x80\x99s Dead.
')

with open('data.txt', 'rb') as f:
    for line in f:
        print(line)

with fi.input(
        files = 'data.txt', 
        inplace = True,
        backup = '.bak',
        mode = 'rb') as f:

    for line in f:
        string = line.decode('utf-8', 'ignore')
        print(string, end="")

~/python_programs$ python3.4 prog.py
b'Chapter 1 \xe2\x80\x93 BlaBla
'
b'Boy\xe2\x80\x99s Dead.
'

~/python_programs$ cat data.txt
Chapter 1 – BlaBla
Boy’s Dead.

Edit:

import fileinput as fi
import re

pattern = r"""
    \              #Match a literal slash...
    x               #Followed by an x...
    [a-f0-9]{2}     #Followed by any hex character, 2 times 
"""

repl = ''

with open('data.txt', 'w') as f:
    print(r"b'Chapter 1 \xe2\x80\x93 BlaBla'", file=f)
    print(r"b'Boy\xe2\x80\x99s Dead.'", file=f)

with open('data.txt') as f:
    for line in f:
        print(line.rstrip()) #Output goes to terminal window

with fi.input(
        files = 'data.txt', 
        inplace = True,
        backup = '.bak') as f:

    for line in f:
        line = line.rstrip()[2:-1]
        new_line = re.sub(pattern,  "", line, flags=re.X)
        print(new_line) #Writes to file, not your terminal window

~/python_programs$ python3.4 prog.py 
b'Chapter 1 \xe2\x80\x93 BlaBla'
b'Boy\xe2\x80\x99s Dead.'

~/python_programs$ cat data.txt
Chapter 1  BlaBla
Boys Dead.

Your file does not contain binary data, so you can read it (or write it) in text mode. It's just a matter of escaping things correctly.

Here is the first part:

print(r"b'Chapter 1 \xe2\x80\x93 BlaBla'", file=f)

Python converts certain backslash escape sequences inside a string to something else. One of the backslash escape sequences that python converts is of the format:

\xNN  #=> e.g. \xe2

The backslash escape sequence is four characters long, but python converts the backslash escape sequence into a single character.

However, I need each of the four characters to be written to the sample file I created. To keep python from converting the backslash escape sequence into one character, you can escape the beginning '\' with another '\':

\xNN

But being lazy, I didn't want to go through your strings and escape each backslash escape sequence by hand, so I used:

r"...."

An r string escapes all the backslashes for you. As a result, python writes all four characters of the \xNN sequence to the file.

The next problem is replacing a backslash in a string using a regex--I think that was your problem to begin with. When a file contains a \, python reads that into a string as \ to represent a literal backslash. As a result, if the file contains the four characters:

\xe2

python reads that into a string as:

"\xe2"

which when printed looks like:

\xe2

The bottom line is: if you can see a '\' in a string that you print out, then the backslash is being escaped in the string. To see what's really inside a string, you should always use repr().

string = "\xe2"
print(string)
print(repr(string))

--output:--
\xe2
'\xe2'

Note that if the output has quotes around it, then you are seeing everything in the string. If the output doesn't have quotes around it, then you can't be sure exactly what's in the string.

To construct a regex pattern that matches a literal back slash in a string, the short answer is: you need to use double the amount of back slashes that you would think. With the string:

"\xe2"

you would think that the pattern would be:

pattern = "\x"

but based on the doubling rule, you actually need:

pattern = "\\x"

And remember r strings? If you use an r string for the pattern, then you can write what seems reasonable, and then the r string will escape all the slashes, doubling them:

pattern r"\x"  #=> equivalent to "\\x"

Python: Converting Binary Literal text file to Normal Text

Answers (1)

Related Questions