Correct xpath syntax with pythons lxml library for parsing all the text from arbitrary nested html tags

Question

Using lxml in python I created this xpath syntax

htmlPage.xpath("/html/body//a/text()")

It gets me all -tags in certain html scopes I desire. Now I encountered that the -tags could look like this:

This is a sentence with some italic text-formatting I want to parse.

xpath returns me a list that has one element more then I expect. I checked that and recognized, that it splits the -tag mentioned above into two list elements, instead of one. Instead of the string

"This is a sentence with some italic text-formatting I want to parse."

I get the two strings

"This is a sentence with some" # and
"-formatting I want to parse."

Is there a way to correct that?

Aufwind · Accepted Answer

I solved my problem by first getting all -tags

results = htmlPage.xpath("/html/body//a")

and then iterating the returned list and using text_content() on the list elements

for a_tag in results:
    print a_tag.text_content() # prints bthe whol string: "This is a sentence with some italic text-formatting I want to parse."

Correct xpath syntax with pythons lxml library for parsing all the text from arbitrary nested html tags

Answers (1)

Related Questions