Clinical
Clinical

Reputation: 523

How to read .csv file that contains utf-8 values by pandas dataframe

I'm trying to read .csv file that contains utf-8 data in some of its columns. The method of reading is by using pandas dataframe. The code is as following:

df = pd.read_csv('Cancer_training.csv', encoding='utf-8')

Then I got the following examples of errors with different files:

(1) 'utf-8' codec can't decode byte 0xcf in position 14:invalid continuation byte

(2) 'utf-8' codec can't decode byte 0xc9 in position 3:invalid continuation byte

Could you please share your ideas and experience with such problem? Thank you.

[python: 3.4.1.final.0, pandas: 0.14.1]

sample of the raw data, I cannot put full record because of the legal restrictions of the medical data:

enter image description here

Upvotes: 6

Views: 9041

Answers (3)

Try this: https://saturncloud.io/blog/how-to-fix-the-pandas-unicodedecodeerror-utf8-codec-cant-decode-bytes-in-position-01-invalid-continuation-byte-error/

It explains why it happens and gives a few workarounds.

If it doesn't work try giving a sample data record. So that I can better understand it.

Please mark as accepted if it works.

Upvotes: 0

Andoni Aranguren
Andoni Aranguren

Reputation: 71

I've also done as Irh09 proposed but the second file it read it was wrongly decoded and couldn't find a column with tildes (á, é, í, ó, ú).

So I recommend encapsulating the error like this:

try:
    df = pd.read_csv('file', encoding = "utf-8")
except UnicodeDecodeError:
    print("Couldn't load as utf-8")
    df = pd.read_csv('file', encoding= "ISO-8859-1")

Upvotes: 1

lrh09
lrh09

Reputation: 587

I had this problem for no apparent reason, I managed to get it work using this:

df = pd.read_csv('file', encoding = "ISO-8859-1")

not sure why though

Upvotes: 4

Related Questions