Reputation: 18166
I'm trying to figure out if I'm using lxml's xpath
function correctly. Here's my current code, including all the workarounds that we've slowly amassed in a pretty sizable scraping library that deals with terrible, terrible input:
import certifi, requests
from lxml import html
s = requests.session()
r = s.get(
url,
verify=certifi.where(),
**request_dict
)
# Throw an error if a bad status code is returned.
r.raise_for_status()
# If the encoding is iso-8859-1, switch it to cp1252 (a superset)
if r.encoding == 'ISO-8859-1':
r.encoding = 'cp1252'
# Grab the content
text = r.text
html_tree = html.fromstring(text)
So if this all works properly, requests
uses r.encoding
to decide how to create a unicode object when r.text
is called. Great. We take that unicode object (text
) and we send that into ltml.html.fromstring()
, which recognizes that it's unicode, and gives us back an ElementTree
.
That all seems to be working properly, but what's troubling is that when I then do:
html_tree.xpath('//text()')[0]
Which should give me the first text node in the tree, I get back a string, not a unicode object, and I find myself having to instead write:
html_tree.xpath('//text()')[0].decode('utf8')
This sucks.
The whole idea of all the work I did at the beginning was to create the Mythical Unicode Sandwich, but no matter what I do, I get back binary strings. What am I missing here?
Here's a proof of concept for you:
import certifi, requests
from lxml import html
s = requests.session()
r = s.get('https://www.google.com', verify=certifi.where())
print type(r.text) # <type 'unicode'>, GREAT!
html_tree = html.fromstring(r.text)
first_node = html_tree.xpath('//text()', smart_strings=False)[0]
print type(first_node) # <type 'str'>, TERRIBLE!
Upvotes: 3
Views: 2038
Reputation: 18166
Well, as so often happens, I found the answer shortly after posting a long, detailed question. The reason lxml
returns byte strings even when you carefully give it unicode is because of a performance optimization in lxml
. From the FAQ:
In Python 2, lxml's API returns byte strings for plain ASCII text values, be it for tag names or text in Element content.
The reasoning is that ASCII encoded byte strings are compatible with Unicode strings in Python 2, but consume less memory (usually by a factor of 2 or 4) and are faster to create because they do not require decoding. Plain ASCII string values are very common in XML, so this optimisation is generally worth it.
However in Python 3:
lxml always returns Unicode strings for text and names, as does ElementTree. Since Python 3.3, Unicode strings containing only characters that can be encoded in ASCII or Latin-1 are generally as efficient as byte strings. In older versions of Python 3, the above mentioned drawbacks apply.
So there you have it. It's a performance optimization in lxml that adds to the confusion around byte and unicode strings.
At least it's fixed in Python 3! Time to upgrade.
Upvotes: 7