lxml HtmlElement xpath parses more than it should be able to

Question

Trying to parse HTML, I fail to loop through all li elements:

from lxml import html

page="one
two"
tree = html.fromstring(page)

for item in tree.xpath("//li"):
  print(html.tostring(item))
  print(item.xpath("//li/text()"))

I expect this output:

b'one'
['one']
b'two'
['two']

but I get this:

b'one'
['one', 'two']
b'two'
['one', 'two']

How is it possible that xpath can get both li elements' text from item in both iteration steps?

I can solve this using an counter as an index of course but I would like to understand what's going on.

qubodup · Accepted Answer

From Lxml html xpath context:

XPath expression //input will match all input elements, anywhere in your document, while .//input will match all inside current context.

The solution is to use:

from lxml import html

page="one
two"
tree = html.fromstring(page)

for item in tree.xpath("//li"):
  print(html.tostring(item))
  print(item.xpath(".//text()")) #only changed line

Adding . before // prevents matching entire document and li/ needs to be removed since you are "inside" the li tags already.

The output is:

b'one'
['one']
b'two'
['two']

lxml HtmlElement xpath parses more than it should be able to

Answers (2)

Related Questions