Reputation: 9603
Trying to parse HTML, I fail to loop through all li
elements:
from lxml import html
page="<ul><li>one</li><li>two</li></ul>"
tree = html.fromstring(page)
for item in tree.xpath("//li"):
print(html.tostring(item))
print(item.xpath("//li/text()"))
I expect this output:
b'<li>one</li>'
['one']
b'<li>two</li>'
['two']
but I get this:
b'<li>one</li>'
['one', 'two']
b'<li>two</li>'
['one', 'two']
How is it possible that xpath
can get both li
elements' text from item
in both iteration steps?
I can solve this using an counter as an index of course but I would like to understand what's going on.
Upvotes: 0
Views: 210
Reputation: 9603
From Lxml html xpath context:
XPath expression
//input
will match all input elements, anywhere in your document, while.//input
will match all inside current context.
The solution is to use:
from lxml import html
page="<ul><li>one</li><li>two</li></ul>"
tree = html.fromstring(page)
for item in tree.xpath("//li"):
print(html.tostring(item))
print(item.xpath(".//text()")) #only changed line
Adding .
before //
prevents matching entire document and li/
needs to be removed since you are "inside" the li
tags already.
The output is:
b'<li>one</li>'
['one']
b'<li>two</li>'
['two']
Upvotes: 1
Reputation: 473893
item.xpath("//li/text()")
would search for all li
elements in the entire tree. Since you want the text of the current node, you can just get the text()
: item.xpath("text()")
.
Or, even better, just get the text content:
for item in tree.xpath("//li"):
print(html.tostring(item))
print(item.text_content())
Upvotes: 1