lxml.html parsing and utf-8 with requests

Question

i used requests to retrieve a url which contains some unicode characters, and want to do some processing with it , then write it out.

r=requests.get(url)
f=open('unicode_test_1.html','w');f.write(r.content);f.close()
html = lxml.html.fromstring(r.content)
htmlOut = lxml.html.tostring(html)
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()

in unicode_test_1.html, all chars looks fine, but in unicode_test_2.html, some chars changed to gibberish, why is that ?

i then tried

html = lxml.html.fromstring(r.text)
htmlOut = lxml.html.tostring(html,encoding='latin1')
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()

it seems it's working now. but i don't know why is this happening, always use latin1 ? what's the difference between r.text and r.content, and why can't i write html out using encoding='utf-8' ?

lxml.html parsing and utf-8 with requests

Answers (1)

Related Questions