A Rob4
A Rob4

Reputation: 1488

Can't open csv file in pandas due to unicode decoding error

I saved a pandas dataframe as a csv using

df_to_save.to_csv(save_file_path)

but when I read it back in using

df_temp = pd.read_csv(file_path)

I get an error message saying

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 158: invalid start byte

I've tried forcing the encoding on reading it to be utf-8 by opening the csv file with

df_temp = pd.read_csv(file_path, index_col=False, encoding="utf-8",sep=',') 

Really stuck, can anyone help?

Many thanks

Upvotes: 2

Views: 3145

Answers (3)

OR TO AVOID ENCODING PROBLEMS USE EXCEL (also return DataFrames)

writer = pd.ExcelWriter('train_numeric.xlsx')
newTRAIN.to_excel(writer,'Sheet1')

THEN

newTEST_excel = pd.read_excel('train_numeric.xlsx')
newTEST_excel.head(2)

Upvotes: 0

MMF
MMF

Reputation: 5921

Change the encoding of your categorical data :

def my_func(df):
    for col in df.columns:
        df[col] = df[col].str.decode('iso-8859-1').str.encode('utf-8')

This function will change in-place the encoding of your categorical data.

Upvotes: 4

user2285236
user2285236

Reputation:

That character is not encoded in UTF-8.

You can reproduce it with (docs):

b'\xbf'.decode("utf-8", "strict")
Traceback (most recent call last):

  File "<ipython-input-7-4db5a43b4577>", line 1, in <module>
    b'\xbf'.decode("utf-8", "strict")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 0: invalid start byte

You can try a different encoding, that would solve the problem for this character:

b'\xbf'.decode("ISO-8859-1", "strict")
Out: '¿'

So your read_csv would change to:

df_temp = pd.read_csv(file_path, index_col=False, encoding="ISO-8859-1") 

Upvotes: 3

Related Questions