dangerChihuahua007
dangerChihuahua007

Reputation: 20895

How do I get around unsupported characters while web scraping?

I am scraping a web page with lxml. At one point, I get the content of a table cell.

# get last name
lastNameContainer = tableRow.xpath('./td[@class="lastName"]');
lastName = lastNameContainer[0].text

Unfortunately, one table cell has a character outside of ASCII's range, which produces this error.

UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-7: ordinal not in range(128)

I tried adding this to the top of my Python file to no avail.

#!/usr/bin/python
# -*- coding: utf-8 -*-

How can I get around this problem? I still want to store this character. This character, by the way, is either ♀ or ♂ depending on the table row.


Update: I realized that the error is triggered when I write the data to a file:

with open('myData.txt', 'w') as myFile:
    myFile.write(lastName + '\n')

Oddly, this still produces the above error.

with open('myData.txt', 'w') as myFile:
    myFile.write(lastName.decode('utf-8') + '\n')

Upvotes: 1

Views: 834

Answers (1)

Diego Navarro
Diego Navarro

Reputation: 9704

lxml needs their strings in unicode. When I get that exception I resolve it using decode('utf-8').

ie: E.doc('♀'.decode('utf-8'))

Updated:

with open('myData.txt', 'w') as myFile:
      myFile.write(lastName + '\n')

Oddly, this still produces the above error.

with open('myData.txt', 'w') as myFile:
      myFile.write(lastName.decode('utf-8') + '\n')

Also notice that if lastName is unicode and you try to write an UTF-8 encoded file you will need to convert it back this way lastName.encode('utf-8')

with open('myData.txt', 'w') as myFile:
    myFile.write(lastName.encode('utf-8') + '\n')

Upvotes: 1

Related Questions