Reputation: 1089
I am trying to import an csv that contains Chinese characters.
this command is to download the csv file
!wget -O wm.csv https://raw.githubusercontent.com/hierarchyJK/compare-LIBSVM-with-Linear-and-Gassian-Kernel/master/%E8%A5%BF%E7%93%9C3.0.csv
The repository is not mine, so I am not sure if it is encoded the right way.
what I can be sure is that it renders correctly.
this code
pd.read_csv('wm.csv',encoding = 'utf-8')
causes this Error
'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte
I've searched this error, didn't find appropriate rca and solution.
this code executed properly
pd.read_csv('wm.csv',encoding = 'cp1252')
but renders the garbled
the system renders Chinese characters correctly.
with python open command
with open('wm.csv', 'r', encoding='cp1252') as f:
for line in f.readlines():
print(line)
break
this code renders something garbled without any warning or error.
±àºÅ,É«Ôó,¸ùµÙ,ÇÃÉù,ÎÆÀí,Æ겿,´¥¸Ð,ÃܶÈ,º¬ÌÇÂÊ,ºÃ¹Ï,Ðò¹Øϵ
Upvotes: 1
Views: 76
Reputation: 1709
You should use the encoding="GBK"
. Hope this will help.
df = pd.read_csv('wm.csv', encoding="GBK")
More details check HERE
Upvotes: 1
Reputation: 29
Here is a link with all of the standard encodings. Latin_1 have worked well for me when I have had issues, but in your case you can try utf_16_be. Good Luck.!
Upvotes: 0
Reputation: 682
The encoding is 'GB18030'. I found this by opening the file in a text editor and checking the suggested encoding. Github actually also shows you the encoding when you go to the github link and click on edit file
Upvotes: 1