Reputation: 81
I'm having trouble handling HTML containing escaped unicode characters (in the Chinese range) in Python3/BeautifulSoup on Windows. BeautifulSoup seems to function correctly, until I try to print an extracted tag, or write out to file. I have my default encoding set to utf-8, yet a cp1252 codec seems to be getting selected...
To reproduce:
soup = BeautifulSoup("隱")
f = open("out.html", "w")
f.write(soup.text)
f.close()
Stack trace attached.
Traceback (most recent call last):
File "scrape.py", line 143, in <module>
test_uni()
File "scrape.py", line 126, in test_uni
f.write(soup.text)
File "c:\venv\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u96b1' in position 0: character maps to <undefined>
Upvotes: 0
Views: 208
Reputation: 20553
You were trying to write non-english (unicode) string to file which Python expects ascii bytestring at default. This is not about your windows environment.
Encode the text before writing to file should work, and utf-8
should be fine with Chinese characters:
f.write(soup.text.encode('utf-8'))
Upvotes: 1