Reputation: 65
I have several text files that contain characters which python 3 is having trouble handling. The most troublesome seems to be "closing" quotation marks.
I have tried reading the files with:
with open(filename, 'r', errors='backslashreplace') as file:
text = file.read()
with open(filename, 'w', errors='backslashreplace') as file:
file.write(text)
and when opening the file in Notepad++ to view the characters, I get xE2 x80
highlighted to indicate a non-text character, followed by \x9d
in normal text.
I see that this deals with the \xE2\x80\x9D
character. In the python REPL I am able to manually create a bytes object like this, decode it as utf-8, and when printed it appears as the character that I expect. I am not sure why when reading the file the character is not understood correctly.
When reading the file to ignore
errors, rather than backslashreplace
, I still get the xE2 X80
characters appearing, and I have not figured out how to perform string operations to remove them.
Ultimately, my goal is to replace all of these strange quotes with normal quotes. There are several ways I can imagine accomplishing this, but they all require me to somehow address (or remove) the xE2 X80
character, or to correctly read the 3-byte \xE2\x80\x9D
character.
Upvotes: 1
Views: 716
Reputation: 1
To create a copy of the file omitting erroneous characters:
def sanitize_file(original_filename, sanitized_filename):
with open(original_filename, 'r', encoding='utf8', errors='ignore') as original_file:
with open(sanitized_filename, 'w', encoding='utf8') as sanitized_file:
sanitized_file.write(original_file.read())
sanitize_file(filename, 'sanitized_' + filename)
Upvotes: 0
Reputation: 2061
Specifying the encoding type should fix the issue. You can do so by doing,
with open(filename, 'r', encoding='utf8', errors='backslashreplace' ) as file:
text = file.read()
with open(filename, 'w', encoding='utf8', errors='backslashreplace') as file:
file.write(text)
Upvotes: 2