Reputation: 5473
I used df.to_csv()
to convert a dataframe to csv file. Under python 3 the pandas doc states that it defaults to utf-8 encoding.
However when I run pd.read_csv()
on the same file, I get the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 8: invalid start byte
But using pd.read_csv()
with encoding="ISO-8859-1"
works.
What is the issue here and how do I resolve it so I can write and load files with consistent encoding?
Upvotes: 2
Views: 9262
Reputation: 862
Please try to read the data using encoding='unicode_escape'.
Upvotes: 3
Reputation: 584
Here is a concrete example of pandas using some unknown(?) encoding when not explicitly using the encoding
parameter with pandas.to_csv
.
0x92 is ’ (looks like an apostrophe)
import pandas
ERRORFILE = r'written_without_encoding_parameter.csv'
NO_ERRORFILE = r'written_WITH_encoding_parameter.csv'
df_dummy = pandas.DataFrame([u"Yo what's up", u"I like your sister’s friend"])
df_dummy.to_csv(ERRORFILE)
df_dummy.to_csv(NO_ERRORFILE, encoding="utf-8")
df_no_error_with_latin = pandas.read_csv(ERRORFILE, encoding="Latin-1")
df_no_error = pandas.read_csv(NO_ERRORFILE)
df_error = pandas.read_csv(ERRORFILE)
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
So it looks like you have to explicitly use encoding="utf-8"
with to_csv
even though pandas docs say it is using this by default. Or use encoding="Latin-1"
with read_csv
.
Even more frustrating...
df_error_even_with_utf8 = pandas.read_csv(ERRORFILE, encoding="utf-8")
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
I am using Windows 7, Python 3.5, pandas 0.19.2.
Upvotes: 0
Reputation: 42905
The original .csv
you are trying to read is encoded
in e.g. ISO-8859-1
. That's why it's a UnicodeDecodeError
- python / pandas is trying to decode
the source using utf-8
codec assuming per default the source is unicode
.
Once you indicate the non-default source encoding, pandas will use the proper codec to match the source and decode into the format used internally.
See python docs and more here. Also very good.
Upvotes: 2