How to fix or remove malformed utf-8 characters in Python3

Question

I have several text files that contain characters which python 3 is having trouble handling. The most troublesome seems to be "closing" quotation marks.

I have tried reading the files with:

with open(filename, 'r', errors='backslashreplace') as file:
    text = file.read()
with open(filename, 'w', errors='backslashreplace') as file:
    file.write(text)

and when opening the file in Notepad++ to view the characters, I get xE2 x80 highlighted to indicate a non-text character, followed by \x9d in normal text.

I see that this deals with the \xE2\x80\x9D character. In the python REPL I am able to manually create a bytes object like this, decode it as utf-8, and when printed it appears as the character that I expect. I am not sure why when reading the file the character is not understood correctly.

When reading the file to ignore errors, rather than backslashreplace, I still get the xE2 X80 characters appearing, and I have not figured out how to perform string operations to remove them.

Ultimately, my goal is to replace all of these strange quotes with normal quotes. There are several ways I can imagine accomplishing this, but they all require me to somehow address (or remove) the xE2 X80 character, or to correctly read the 3-byte \xE2\x80\x9D character.

Axois · Accepted Answer

Specifying the encoding type should fix the issue. You can do so by doing,

with open(filename, 'r', encoding='utf8', errors='backslashreplace' ) as file:
    text = file.read()
with open(filename, 'w', encoding='utf8', errors='backslashreplace') as file:
    file.write(text)

How to fix or remove malformed utf-8 characters in Python3

Answers (2)

Related Questions