Reputation: 65
Suppose I read an html website and I get a list of names, such as: 'Amiel, Henri-Frédéric'.
In order to get the list of names I decode the html using the following code:
f = urllib.urlopen("http://xxx.htm")
html = f.read()
html=html.decode('utf8')
t.feed(html)
t.close()
lista=t.data
At this point, the variable lista contains a list of names like:
[u'Abatantuono, Diego', ... , u'Amiel, Henri-Frédéric']
Now I would like to:
For simplicity, let's take in consideration just the above name to complete steps 1 to 3. I would use the following code:
name=u'Amiel, Henri-Fr\xe9d\xe9ric'
name=name.encode('utf8')
array=[name]
df=pd.DataFrame({'Names':array})
df.to_csv('names')
uni=pd.read_csv('names')
uni #trying to read the csv file in a DataFrame
At this point i get the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 67: invalid continuation byte
If I substitute the last row of the above code with:
print uni
I can read the DataFrame but I don't think it is the right way to handle this issue.
I red many questions posted by other users about this argument but I didn't get to solve this one.
Upvotes: 5
Views: 20954
Reputation: 80346
Bothto_csv
method and read_csv
function take an encoding
argument. Use it. And work with unicode internally. If you don't, trying to encode/decode inside your program will get you.
import pandas as pd
name = u'Amiel, Henri-Fr\xe9d\xe9ric'
array = [name]
df = pd.DataFrame({'Names':array})
df.to_csv('names', encoding='utf-8')
uni = pd.read_csv('names', index_col = [0], encoding='utf-8')
print uni # for me it works with or without print
out:
Names
0 Amiel, Henri-Frédéric
Upvotes: 9