Reputation: 90
Alright, So I'm writing a code where I read a CSV file using pandas.read_csv
, the problem is with the encoding, I was using utf-8-sig
encoding and this is working. However, this gives me an error with other CSV files. I found out that some files need other types of encoding such as cp1252
. The problem is that I can't restrict the user to a specific CSV type that matches my encoding.
So is there any solution for this? for example is there a universal encoding type that works for all CSV's? or can I pass an array of all the possible encoders?
Upvotes: 1
Views: 16053
Reputation: 13468
You could try this: https://stackoverflow.com/a/48556203/11246056
Or iterate over several formats in a try/except statement:
encodings = ["utf-8-sig", "cp1252", "iso-8859-1", "latin1"]
try:
for encoding in encodings:
pandas.read_csv(..., encoding=encoding, ...)
...
except ValueError: # or the error you receive
continue
Upvotes: 1
Reputation: 532
Here's a similar solution that loops over different types of encodings. Once a valid encoding is used, break from the loop and continue!
encodings = ["utf-8","utf-8-sig", "iso-8859-1", "latin1", "cp1252"]
for encoding in encodings:
try:
dataframe = pd.read_csv(input_data_path,encoding=encoding)
break
except Exception as e: # or the error you receive
pass
Upvotes: 1
Reputation: 148890
A CSV file is a text file. If it contains only ASCII characters, no problem nowadays, most encodings can correctly handle plain ASCII characters. The problem arises with non ASCII characters. Exemple
character | Latin1 code | cp850 code | UTF-8 codes |
---|---|---|---|
é | '\xe9' |
'\x82' |
'\xc3\xa9' |
è | '\xe8' |
'\x8a' |
'\xc3\xa8' |
ö | '\xf6' |
'\x94' |
'\xc3\xb6' |
Things are even worse, because single bytes character sets can represent at most 256 characters while UTF-8 can represent all. For example beside the normal quote character '
, unicode contains left ‘
or right ’
versions of it, none of them being represented in Latin1 nor CP850.
Long Story short, there is nothing like an universal encoding. But certain encodings, for example Latin1 have a specificity: they can decode any byte. So if you declare a Latin1 encoding, no UnicodeDecodeError will be raised. Simply if the file was UTF-8 encoded, a é
will look like é
. And the right single quote would be 'â\x80\x99'
but will appear as â
on an Latin1 system and as ’
on a cp1252 one.
As you spoke of CP1252, it is a Windows variant of Latin1, but it does not share the property of being able to decode any byte.
The common way is to ask people sending you CSV file to use the same encoding and try to decode with that encoding. Then you have two workarounds for badly encoded files. First is the one proposed by CygnusX: try a sequence of encodings terminated with Latin1, for example encodings = ["utf-8-sig", "utf-8", "cp1252", "latin1"]
(BTW Latin1 is an alias for ISO-8859-1 so no need to test both).
The second one is to open the file with errors='replace'
: any offending byte will be replaced with a replacement character. At least all ASCII characters will be correct:
with open(filename, encoding='utf-8-sig', errors='replace') as file:
fd = pd.read_csv(file, other_parameters...)
Upvotes: 3