Reputation: 2162
I made a script to download a few pages from a server using BeautifulSoup. I am writing the output to a .csv file. I am using python 2.7.2
I get the following error at some point:
Traceback (most recent call last):
File "parser.py", line 114, in <module>
c.writerow([title,description,price,weight,category,subcategory])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb7' in position 61: ordinal not in range(128)
The page I am downloading from (I checked the exact page) doesn't seem to have any weird characters.
I tried some of the solutions from the similar questions. I tried decoding like this:
content.decode('utf-8','ignore')
but it did not work.
As pointed out in Python and BeautifulSoup encoding issues . I checked the website source and it doesn't have any specified meta data either. I also tried using the ''chardet'' as suggested in How to download any(!) webpage with correct charset in python? however the urlread() method doesn't seem to work. I tried with urlopen() instead and it crashed.
How can I proceed with this?
Upvotes: 0
Views: 2912
Reputation: 1121406
BeautifulSoup gives you unicode, so to write this to the file you need to encode the data:
content.encode('utf8')
Do this before passing the data to the csv
.writerow()
method. There is no need to add 'ignore'
here because UTF-8 can encode all of Unicode. Your full line could be:
c.writerow([e.encode('utf8') for e in (title, description, price, weight, category, subcategory)])
using a list comprehension to encode each element in turn.
If you need to manipulate the strings first, turn the NavigableString
objects to unicode
objects first:
unicode(description)
Alternatively, instead of encoding each column value, use the UnicodeWriter
class included in the csv
module examples section to have your data encoded automatically.
HTML can often use characters like em-dashes or non-breaking spaces that are not encodable to ASCII, and you won't pick those out with a quick visual scan of the page.
Upvotes: 3
Reputation: 348
It seems like the contents of the page have been successfully parsed into a unicode object, but that the CSV writer is implicitly trying to convert back to str and therefore throwing the above error. As UTF-8 will work for any character, you can hopefully use the following:
c.writerow([title.encode("UTF-8"),description.encode("UTF-8"),price.encode("UTF-8"),weight.encode("UTF-8"),category.encode("UTF-8"),subcategory.encode("UTF-8")])
If that doesn't work, then you could try to debug it further by finding out exactly what format the data is in at that point. You can do this by writing the string representations of each variable to the CSV file, rather than the strings themselves, as follows:
c.writerow([repr(title),repr(description),repr(price),repr(weight),repr(category),repr(subcategory)])
Then you can look in the CSV file, and you might see rows like:
"abc","def",u"\u00A0123","456","abc","def"
You can then paste any tricky looking strings (such as u"\u00A0123") into a python window and play around with them directly, trying different ways of encoding and decoding.
Upvotes: 1