Seungmin Hong
Seungmin Hong

Reputation: 21

Why do I get a unicode encoding error in the middle of reading a file with python?

I am trying to make a program that reads every line that reads "hwlog read" in a txt file, and it runs fine, until the middle of the file, where it returns

(return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2856: character maps to undefined)

The part of the code that reads the program, is

with open(name, "r") as f:
    print("DEBUG")
    for line in f:
        if len(line.split()) == 5:
            if line.split()[-2] == "hwlog" and line.split()[-1] == "read":
                 input(line)

For the first few times, it works fine

lhsh BXP_1024 hwlog read
lhsh BXP_1024_1 hwlog read
lhsh BXP_1024_2 hwlog read
lhsh BXP_1025 hwlog read
lhsh BXP_1025_4 hwlog read
lhsh BXP_1025_5 hwlog read
lhsh BXP_2048 hwlog read
lhsh BXP_2049 hwlog read
lhsh BXP_2050 hwlog read
lhsh BXP_2051 hwlog read
lhsh BXP_2052 hwlog read

But after line 240070, it returns that error from before. I tried re-converting the file into UTF-8, and even tried reinstalling python and tried on other devices, but it continues to happen. Why does this happen, and how can I fix this issue?

Upvotes: 1

Views: 830

Answers (1)

jsbueno
jsbueno

Reputation: 110156

Your file has non-ASCII characters in it. Python 3 will open the file using a default encoding configured in your OS, if you put no explicit encoding argument when opening a text file - in this particular case, I can't tell you what encoding it is - it is not utf-8 or latin1, as one would give a different error message ("invalid starting byte") and the other would not fail with 0x81 .

Using latin1 will likely work to read your file without an UnicodeDecodeException - however, your data will still be broken - since "0x81" is not a meaningful character in Latin1 - so try to findout which is the text-encoding of your file first.

Ifit feels ambiguous when I talk about "discover what is the encoding from your file", I strongly suggest you read this article, right now, before continuing any tasks.

Now, on trying a guess while "\x81" is not meaningful by itself in utf-8, it might be the second byte for an 'Á' char, which is encoded as b"\xc3\x81".

So, you might give it a try - just change your file-open line to:

with open(name, "r", encoding="utf-8") as f:

If it does not yield an error, then the file is in utf-8, as bytes wiht value > 127 must have a meaningful sequence that would not be met by chance.

Otherwise, just set the encoding as "latin-1" - it performs a transparent conversion from bytes to unicode code-points, but be aware that you are inserting mojibake into your data.

Upvotes: 1

Related Questions