qubodup
qubodup

Reputation: 9603

lxml HtmlElement xpath parses more than it should be able to

Trying to parse HTML, I fail to loop through all li elements:

from lxml import html

page="<ul><li>one</li><li>two</li></ul>"
tree = html.fromstring(page)

for item in tree.xpath("//li"):
  print(html.tostring(item))
  print(item.xpath("//li/text()"))

I expect this output:

b'<li>one</li>'
['one']
b'<li>two</li>'
['two']

but I get this:

b'<li>one</li>'
['one', 'two']
b'<li>two</li>'
['one', 'two']

How is it possible that xpath can get both li elements' text from item in both iteration steps?

I can solve this using an counter as an index of course but I would like to understand what's going on.

Upvotes: 0

Views: 210

Answers (2)

qubodup
qubodup

Reputation: 9603

From Lxml html xpath context:

XPath expression //input will match all input elements, anywhere in your document, while .//input will match all inside current context.

The solution is to use:

from lxml import html

page="<ul><li>one</li><li>two</li></ul>"
tree = html.fromstring(page)

for item in tree.xpath("//li"):
  print(html.tostring(item))
  print(item.xpath(".//text()")) #only changed line

Adding . before // prevents matching entire document and li/ needs to be removed since you are "inside" the li tags already.

The output is:

b'<li>one</li>'
['one']
b'<li>two</li>'
['two']

Upvotes: 1

alecxe
alecxe

Reputation: 473893

item.xpath("//li/text()") would search for all li elements in the entire tree. Since you want the text of the current node, you can just get the text(): item.xpath("text()").

Or, even better, just get the text content:

for item in tree.xpath("//li"):
  print(html.tostring(item))
  print(item.text_content())

Upvotes: 1

Related Questions