abdullatif
abdullatif

Reputation: 90

type of encoding to read csv files in pandas

Alright, So I'm writing a code where I read a CSV file using pandas.read_csv, the problem is with the encoding, I was using utf-8-sig encoding and this is working. However, this gives me an error with other CSV files. I found out that some files need other types of encoding such as cp1252. The problem is that I can't restrict the user to a specific CSV type that matches my encoding. So is there any solution for this? for example is there a universal encoding type that works for all CSV's? or can I pass an array of all the possible encoders?

Upvotes: 1

Views: 16053

Answers (3)

Laurent
Laurent

Reputation: 13468

You could try this: https://stackoverflow.com/a/48556203/11246056

Or iterate over several formats in a try/except statement:

encodings = ["utf-8-sig", "cp1252", "iso-8859-1", "latin1"]
try:
    for encoding in encodings:
        pandas.read_csv(..., encoding=encoding, ...)
        ...
except ValueError:  # or the error you receive
    continue

Upvotes: 1

Doracahl
Doracahl

Reputation: 532

Here's a similar solution that loops over different types of encodings. Once a valid encoding is used, break from the loop and continue!

encodings = ["utf-8","utf-8-sig", "iso-8859-1", "latin1", "cp1252"]
for encoding in encodings:
    try:
        dataframe = pd.read_csv(input_data_path,encoding=encoding)
        break
    except Exception as e:  # or the error you receive
        pass

Upvotes: 1

Serge Ballesta
Serge Ballesta

Reputation: 148890

A CSV file is a text file. If it contains only ASCII characters, no problem nowadays, most encodings can correctly handle plain ASCII characters. The problem arises with non ASCII characters. Exemple

character Latin1 code cp850 code UTF-8 codes
é '\xe9' '\x82' '\xc3\xa9'
è '\xe8' '\x8a' '\xc3\xa8'
ö '\xf6' '\x94' '\xc3\xb6'

Things are even worse, because single bytes character sets can represent at most 256 characters while UTF-8 can represent all. For example beside the normal quote character ', unicode contains left or right versions of it, none of them being represented in Latin1 nor CP850.

Long Story short, there is nothing like an universal encoding. But certain encodings, for example Latin1 have a specificity: they can decode any byte. So if you declare a Latin1 encoding, no UnicodeDecodeError will be raised. Simply if the file was UTF-8 encoded, a é will look like é. And the right single quote would be 'â\x80\x99' but will appear as â on an Latin1 system and as ’ on a cp1252 one.

As you spoke of CP1252, it is a Windows variant of Latin1, but it does not share the property of being able to decode any byte.

The common way is to ask people sending you CSV file to use the same encoding and try to decode with that encoding. Then you have two workarounds for badly encoded files. First is the one proposed by CygnusX: try a sequence of encodings terminated with Latin1, for example encodings = ["utf-8-sig", "utf-8", "cp1252", "latin1"] (BTW Latin1 is an alias for ISO-8859-1 so no need to test both).

The second one is to open the file with errors='replace': any offending byte will be replaced with a replacement character. At least all ASCII characters will be correct:

with open(filename, encoding='utf-8-sig', errors='replace') as file:
    fd = pd.read_csv(file, other_parameters...)

Upvotes: 3

Related Questions