Reputation: 20895
I am scraping a web page with lxml. At one point, I get the content of a table cell.
# get last name
lastNameContainer = tableRow.xpath('./td[@class="lastName"]');
lastName = lastNameContainer[0].text
Unfortunately, one table cell has a character outside of ASCII's range, which produces this error.
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-7: ordinal not in range(128)
I tried adding this to the top of my Python file to no avail.
#!/usr/bin/python
# -*- coding: utf-8 -*-
How can I get around this problem? I still want to store this character. This character, by the way, is either ♀ or ♂ depending on the table row.
Update: I realized that the error is triggered when I write the data to a file:
with open('myData.txt', 'w') as myFile:
myFile.write(lastName + '\n')
Oddly, this still produces the above error.
with open('myData.txt', 'w') as myFile:
myFile.write(lastName.decode('utf-8') + '\n')
Upvotes: 1
Views: 834
Reputation: 9704
lxml needs their strings in unicode. When I get that exception I resolve it using decode('utf-8')
.
ie: E.doc('♀'.decode('utf-8'))
Updated:
with open('myData.txt', 'w') as myFile: myFile.write(lastName + '\n')
Oddly, this still produces the above error.
with open('myData.txt', 'w') as myFile: myFile.write(lastName.decode('utf-8') + '\n')
Also notice that if lastName is unicode
and you try to write an UTF-8
encoded file you will need to convert it back this way lastName.encode('utf-8')
with open('myData.txt', 'w') as myFile:
myFile.write(lastName.encode('utf-8') + '\n')
Upvotes: 1