Reputation: 1859
I have a text file (created by redirecting the output of find
command in linux). The first 4 lines of the file are shown below :
/home/uujjwal/datasets/pedestrian/INRIAPerson/Train/pos/person_299.png
/home/uujjwal/datasets/pedestrian/INRIAPerson/Train/pos/crop001540.png
/home/uujjwal/datasets/pedestrian/INRIAPerson/Train/pos/crop001044.png
/home/uujjwal/datasets/pedestrian/INRIAPerson/Train/pos/person_195.png
I read it in Python 2.7 as a complete string by using the following :
fid = open('filelist.txt', 'r').read() # Successful
When I try to do the same in python 3.5, I get the following error :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 141: invalid continuation byte
I realised the differences between python 3.5 and 2.7 and tried to specify the ascii
encoding. I determined the ascii
encoding by using the chardet package as follows (using its command line tool) :
[uujjwal@rotanev pos]$ chardetect /home/uujjwal/datasets/pedestrian/INRIAPerson/Train/pos/filelist.txt
/home/uujjwal/datasets/pedestrian/INRIAPerson/Train/pos/filelist.txt: ascii with confidence 1.0
Hence I did the following :
fstr = open(annotation_file, 'r', encoding='ascii').read() #Failure
This gave the following error :
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf4 in position 141: ordinal not in range(128)
I want to understand that :
NOTE: I did not manually modify the text file in any way.
Addendum
I checked the entire contents of the file. It has letters a-z, A-Z, 0-9 (All with well known ASCII values) forward slash (/) (extended ASCII value of 47) and underscore (_) (extended ASCII value of 95) alongwith a dot (.) (extended ASCII value of 46). It also has newline character (** extended ASCII Value of 10**). There are no other characters in the file.
The byte 0xf4 corresponds to the extended ASCII of 244 (paragraph sign). This is something which just cannot exist as the file has been created by redirecting the output of find
command.
Upvotes: 0
Views: 692
Reputation: 1123590
The file is not ASCII encoded. Chardet uses heuristics and does not test the whole file, and got it wrong here. It isn't UTF-8 either, evidenced by the other error.
Chardet can't always tell what it is looking at:
>>> chardet.detect((('This mostly ASCII, with a hidden surprise' * 20) + 'hellø').encode('utf8'))
{'encoding': 'ISO-8859-2', 'confidence': 0.7225312698370376}
The ø
encoded to just two bytes:
>>> 'ø'.encode('utf8')
b'\xc3\xb8'
which is not enough information for chardet to make the right call.
Use a different codec to open the file. What codec exactly is hard to say; in Latin-1 and Windows Codepage 1252, 0xF4 is the ô
character, which doesn't immediately look like it'd fit in with the minimal data you've shown (position 141 would be within those first 4 lines).
Note that in Python 2, you really only read the binary contents of the file, without the data being decoded to Unicode text, which is why you don't get the error there.
Note that there is no such thing as 'extended ASCII'. The term exists but is bogus, there is no standard with that name, and is used for almost any 8-bit codec that is a superset of ASCII. A byte value 0xF4 means different things in different codecs; in the IBM 775 codepage it is the Paragraph (Pilcrow) symbol, as would it be in code pages 850, 856, 857 and 858.
Upvotes: 3