Reputation: 71
I am trying to read a text file using the following statement:
with open(inputFile) as fp:
for line in fp:
if len(line) > 0:
lineRecords.append(line.strip());
The problem is that I get the following error:
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6880: character maps to <undefined>
My question is how can I identify exactly where in the file the error is encountered since the position Python gives is tied to the location in the record being read at the time and not the absolution position in the file. So is it the 6,880 character in record 20 or the 6,880 character in record 2000? Without record information, the position value returned by Python is worthless.
Bottom line: is there a way to get Python to tell me what record it was processing at the time it encountered the error?
(And yes I know that 0x9d is a tab character and that I can do a search for that but that is not what I am after.)
Thanks.
Update: the post at UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function has nothing to do with the question I am asking - which is how can I get Python to tell me what record of the input file it was reading when it encountered the unicode error.
Upvotes: 1
Views: 365
Reputation: 56
I have faced this issue before and the easiest fix is to open file in utf8 mode
with open(inputFile, encoding="utf8") as fp:
Upvotes: 0
Reputation: 308111
I think the only way is to track the line number separately and output it yourself.
with open(inputFile) as fp:
num = 0
try:
for num, line in enumerate(fp):
if len(line) > 0:
lineRecords.append(line.strip())
except UnicodeDecodeError as e:
print('Line ', num, e)
Upvotes: 2
Reputation: 106445
You can use the read
method of the file object to obtain the first 6880 characters, encode it, and the length of the resulting bytes object will be the index of the starting byte of the offending character:
with open(inputFile) as fp:
print(len(fp.read(6880).encode()))
Upvotes: 0