Guga
Guga

Reputation: 349

How to use the appropriate encoding when reading csv in Pandas?

I am trying to load datasets from FDIC. Every quarter FDIC releases a zip file that contains around 62 csv files with names like the following:

All_Reports_20080331_Assets and Liabilities.cvs,
All_Reports_20080331_Bank Assets Sold and Securitized.csv, 
etc

I have downloaded the all the files in a directory like the following:

C:\projects\FDIC\All_Reports_20080331

Once there are many zip files, from different quarters available, I am starting to prepare a structure for a loop that will run over many paths (each one representing a quarter with 62 csv files inside). Before getting into the loop, however, the upload is not working due to some utf_8 error.

import pandas as pd
path = r"C:\projects\FDIC\All_Reports_20080331"
file = r"\All_Reports_20080331_Assets and Liabilities.csv"
df_assets_&_liab = pd.read_csv(path+file)

gives me the following error:

'utf-8' codec can't decode byte 0xfc in position 5: invalid start byte

I tried to use a parameter in pandas.read_csv to "utf_8" but error message is the same.

Any idea on how to better load those files via panda? Thanks a lot!

ps: the forder with the 62 csv files can be found here: FDIC Website

Upvotes: 0

Views: 1094

Answers (1)

giser_yugang
giser_yugang

Reputation: 6166

First look at the encoding format of the file.

import chardet
with open(path+file,"rb") as f:
    data = f.read()
    print(chardet.detect(data))

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

Then

df_assets_&_liab = pd.read_csv(path+file,encoding='ISO-8859-1')

Upvotes: 1

Related Questions