Reputation: 1488
I saved a pandas dataframe as a csv using
df_to_save.to_csv(save_file_path)
but when I read it back in using
df_temp = pd.read_csv(file_path)
I get an error message saying
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 158: invalid start byte
I've tried forcing the encoding on reading it to be utf-8 by opening the csv file with
df_temp = pd.read_csv(file_path, index_col=False, encoding="utf-8",sep=',')
Really stuck, can anyone help?
Many thanks
Upvotes: 2
Views: 3145
Reputation: 101
OR TO AVOID ENCODING PROBLEMS USE EXCEL (also return DataFrames)
writer = pd.ExcelWriter('train_numeric.xlsx')
newTRAIN.to_excel(writer,'Sheet1')
THEN
newTEST_excel = pd.read_excel('train_numeric.xlsx')
newTEST_excel.head(2)
Upvotes: 0
Reputation: 5921
Change the encoding of your categorical data :
def my_func(df):
for col in df.columns:
df[col] = df[col].str.decode('iso-8859-1').str.encode('utf-8')
This function will change in-place the encoding of your categorical data.
Upvotes: 4
Reputation:
That character is not encoded in UTF-8.
You can reproduce it with (docs):
b'\xbf'.decode("utf-8", "strict")
Traceback (most recent call last):
File "<ipython-input-7-4db5a43b4577>", line 1, in <module>
b'\xbf'.decode("utf-8", "strict")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 0: invalid start byte
You can try a different encoding, that would solve the problem for this character:
b'\xbf'.decode("ISO-8859-1", "strict")
Out: '¿'
So your read_csv
would change to:
df_temp = pd.read_csv(file_path, index_col=False, encoding="ISO-8859-1")
Upvotes: 3