Reputation: 19
No matter what I do I couldn't fix it. The script I need to fix is this;
# Read the original file and write to a new file
input_file = 'input.txt'
output_file = 'output.txt'
with open(input_file, 'rb') as f:
content = f.read()
# Filter out non-UTF-8 characters
cleaned_content = content.decode('utf-8', errors='replace').replace('�','?')
# Split the cleaned content into lines
lines = cleaned_content.splitlines()
# Sort the lines
sorted_lines = sorted(lines)
# Write the sorted lines to a new file
with open(output_file, 'w', encoding='utf-8') as f:
for line in sorted_lines:
f.write(line + '\n')
What I want is to file to never give me UnicodeDecodeError when I do with open(file_path, 'r', encoding='utf-8') as file:
Long story short I have a byte-search script working on sorted file. If I do with open(file_path, 'r', encoding='utf-8', errors='replace') as file:
It doesn't work properly because it's changing the character that would give UnicodeDecodeError normally. Imagine the file is like that
it's reading it as that.
a
b
�
d
If it's searching for "c" and comes to the line starting with � then it would check if c comes before � or after and goes to incorrect direction (up instead of down let's say) because the file is sorted regarding utf-8.
I want to make sure the file wouldn't give me UnicodeDecodeError because all the characters that can give that error is changed by "?" then sorted.
No matter what I tried it's always having that weird characters.
How can I do that?
Upvotes: 0
Views: 74
Reputation: 3096
with open(input_file, 'r', encoding='utf-8', errors='replace') as f:
lines = f.read().replace('�','?')
but I got different results than in your comment above:
file = open('output_.txt', 'wb')
try:
##### Write binary data to file
file.write(b'\x61\x62\x63\x64\x65\x66\x67\x0A\xC0\xC1')
finally:
### Close the file
file.close()
using file -i output_.txt
I get:
output_.txt: text/plain; charset=iso-8859-1
Upvotes: 0