JJJohn
JJJohn

Reputation: 1089

Pandas dataframe imports and renders incorrectly and causes UnicodeDecodeError

I am trying to import an csv that contains Chinese characters.

this command is to download the csv file

!wget -O wm.csv https://raw.githubusercontent.com/hierarchyJK/compare-LIBSVM-with-Linear-and-Gassian-Kernel/master/%E8%A5%BF%E7%93%9C3.0.csv

The repository is not mine, so I am not sure if it is encoded the right way.

what I can be sure is that it renders correctly.

this code

pd.read_csv('wm.csv',encoding = 'utf-8')

causes this Error

'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte

I've searched this error, didn't find appropriate rca and solution.

this code executed properly

pd.read_csv('wm.csv',encoding = 'cp1252')

but renders the garbled

enter image description here

the system renders Chinese characters correctly.

enter image description here

with python open command

with open('wm.csv', 'r', encoding='cp1252') as f:
    for line in f.readlines():
        print(line)
        break

this code renders something garbled without any warning or error.

±àºÅ,É«Ôó,¸ùµÙ,ÇÃÉù,ÎÆÀí,Æ겿,´¥¸Ð,ÃܶÈ,º¬ÌÇÂÊ,ºÃ¹Ï,Ðò¹Øϵ

Upvotes: 1

Views: 76

Answers (3)

R.A.Munna
R.A.Munna

Reputation: 1709

You should use the encoding="GBK". Hope this will help.

df = pd.read_csv('wm.csv', encoding="GBK")

More details check HERE

Upvotes: 1

JJSSEE
JJSSEE

Reputation: 29

Here is a link with all of the standard encodings. Latin_1 have worked well for me when I have had issues, but in your case you can try utf_16_be. Good Luck.!

Standard Encodings

Upvotes: 0

NickHilton
NickHilton

Reputation: 682

The encoding is 'GB18030'. I found this by opening the file in a text editor and checking the suggested encoding. Github actually also shows you the encoding when you go to the github link and click on edit file

Upvotes: 1

Related Questions