Reputation: 523
I'm trying to read .csv file that contains utf-8 data in some of its columns. The method of reading is by using pandas dataframe. The code is as following:
df = pd.read_csv('Cancer_training.csv', encoding='utf-8')
Then I got the following examples of errors with different files:
(1) 'utf-8' codec can't decode byte 0xcf in position 14:invalid continuation byte
(2) 'utf-8' codec can't decode byte 0xc9 in position 3:invalid continuation byte
Could you please share your ideas and experience with such problem? Thank you.
[python: 3.4.1.final.0, pandas: 0.14.1]
sample of the raw data, I cannot put full record because of the legal restrictions of the medical data:
Upvotes: 6
Views: 9041
Reputation: 63
It explains why it happens and gives a few workarounds.
If it doesn't work try giving a sample data record. So that I can better understand it.
Please mark as accepted if it works.
Upvotes: 0
Reputation: 71
I've also done as Irh09 proposed but the second file it read it was wrongly decoded and couldn't find a column with tildes (á, é, í, ó, ú).
So I recommend encapsulating the error like this:
try:
df = pd.read_csv('file', encoding = "utf-8")
except UnicodeDecodeError:
print("Couldn't load as utf-8")
df = pd.read_csv('file', encoding= "ISO-8859-1")
Upvotes: 1
Reputation: 587
I had this problem for no apparent reason, I managed to get it work using this:
df = pd.read_csv('file', encoding = "ISO-8859-1")
not sure why though
Upvotes: 4