Aufwind
Aufwind

Reputation: 26258

Correct xpath syntax with pythons lxml library for parsing all the text from arbitrary nested html tags

Using lxml in python I created this xpath syntax

htmlPage.xpath("/html/body//a/text()")

It gets me all <a>-tags in certain html scopes I desire. Now I encountered that the <a>-tags could look like this:

<a>This is a sentence with some <italic>italic text</italic>-formatting I want to parse.</a>

xpath returns me a list that has one element more then I expect. I checked that and recognized, that it splits the <a>-tag mentioned above into two list elements, instead of one. Instead of the string

"This is a sentence with some italic text-formatting I want to parse."

I get the two strings

"This is a sentence with some" # and
"-formatting I want to parse."

Is there a way to correct that?

Upvotes: 0

Views: 680

Answers (1)

Aufwind
Aufwind

Reputation: 26258

I solved my problem by first getting all <a>-tags

results = htmlPage.xpath("/html/body//a")

and then iterating the returned list and using text_content() on the list elements

for a_tag in results:
    print a_tag.text_content() # prints bthe whol string: "This is a sentence with some italic text-formatting I want to parse."

Upvotes: 2

Related Questions