Reputation: 21
I am trying to make a program that reads every line that reads "hwlog read" in a txt file, and it runs fine, until the middle of the file, where it returns
(return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2856: character maps to undefined)
The part of the code that reads the program, is
with open(name, "r") as f:
print("DEBUG")
for line in f:
if len(line.split()) == 5:
if line.split()[-2] == "hwlog" and line.split()[-1] == "read":
input(line)
For the first few times, it works fine
lhsh BXP_1024 hwlog read
lhsh BXP_1024_1 hwlog read
lhsh BXP_1024_2 hwlog read
lhsh BXP_1025 hwlog read
lhsh BXP_1025_4 hwlog read
lhsh BXP_1025_5 hwlog read
lhsh BXP_2048 hwlog read
lhsh BXP_2049 hwlog read
lhsh BXP_2050 hwlog read
lhsh BXP_2051 hwlog read
lhsh BXP_2052 hwlog read
But after line 240070, it returns that error from before. I tried re-converting the file into UTF-8, and even tried reinstalling python and tried on other devices, but it continues to happen. Why does this happen, and how can I fix this issue?
Upvotes: 1
Views: 830
Reputation: 110156
Your file has non-ASCII characters in it. Python 3 will open the file using a default encoding configured in your OS, if you put no explicit encoding
argument when opening a text file - in this particular case, I can't tell you what encoding it is - it is not utf-8 or latin1, as one would give a different error message ("invalid starting byte") and the other would not fail with 0x81 .
Using latin1 will likely work to read your file without an UnicodeDecodeException - however, your data will still be broken - since "0x81" is not a meaningful character in Latin1 - so try to findout which is the text-encoding of your file first.
Ifit feels ambiguous when I talk about "discover what is the encoding from your file", I strongly suggest you read this article, right now, before continuing any tasks.
Now, on trying a guess while "\x81" is not meaningful by itself in utf-8, it might be the second byte for an 'Á' char, which is encoded as b"\xc3\x81".
So, you might give it a try - just change your file-open line to:
with open(name, "r", encoding="utf-8") as f:
If it does not yield an error, then the file is in utf-8, as bytes wiht value > 127 must have a meaningful sequence that would not be met by chance.
Otherwise, just set the encoding as "latin-1" - it performs a transparent conversion from bytes to unicode code-points, but be aware that you are inserting mojibake into your data.
Upvotes: 1