Random Guy
Random Guy

Reputation: 19

Python3 UnicodeDecodeError on utf8

No matter what I do I couldn't fix it. The script I need to fix is this;

# Read the original file and write to a new file
input_file = 'input.txt'
output_file = 'output.txt'

with open(input_file, 'rb') as f:
    content = f.read()

# Filter out non-UTF-8 characters
cleaned_content = content.decode('utf-8', errors='replace').replace('�','?')

# Split the cleaned content into lines
lines = cleaned_content.splitlines()

# Sort the lines
sorted_lines = sorted(lines)

# Write the sorted lines to a new file
with open(output_file, 'w', encoding='utf-8') as f:
    for line in sorted_lines:
        f.write(line + '\n')

What I want is to file to never give me UnicodeDecodeError when I do with open(file_path, 'r', encoding='utf-8') as file:

Long story short I have a byte-search script working on sorted file. If I do with open(file_path, 'r', encoding='utf-8', errors='replace') as file: It doesn't work properly because it's changing the character that would give UnicodeDecodeError normally. Imagine the file is like that it's reading it as that.

a
b
�
d

If it's searching for "c" and comes to the line starting with � then it would check if c comes before � or after and goes to incorrect direction (up instead of down let's say) because the file is sorted regarding utf-8.

I want to make sure the file wouldn't give me UnicodeDecodeError because all the characters that can give that error is changed by "?" then sorted.

No matter what I tried it's always having that weird characters.

How can I do that?

Upvotes: 0

Views: 74

Answers (1)

pippo1980
pippo1980

Reputation: 3096

with open(input_file, 'r', encoding='utf-8', errors='replace') as f:
    lines = f.read().replace('�','?')

but I got different results than in your comment above:

file = open('output_.txt', 'wb')
try:
    ##### Write binary data to file

    file.write(b'\x61\x62\x63\x64\x65\x66\x67\x0A\xC0\xC1')
finally:
    ### Close the file

    file.close()

using file -i output_.txt I get:

output_.txt: text/plain; charset=iso-8859-1

Upvotes: 0

Related Questions