Reputation:

How to select text without the HTML markup

I am working on a web scraper (using Python), so I have a chunk of HTML from which I am trying to extract text. One of the snippets looks something like this:

<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>

I want to extract the text from this class. Now, I could use something along the lines of

//p[@class='something')]//text()

but this leads to each chunk of text ending up as a separate result element, like this:

(This class has some ,text, and a few ,links, in it.)

The desired output would contain all the text in one element, like this:

This class has some text and a few links in it.

Is there an easy or elegant way to achieve this?

Edit: Here's the code that produces the result given above.

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']//text()"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item)

Upvotes: 2

Answers (3)

bjimba

Reputation: 928

An alternate one-liner on your original code: use a join with an empty string separator:

print("".join(query_results))

Upvotes: 0

Robᵩ

Reputation: 168836

You could call .text_content() on the lxml Element, instead of fetching the text with XPath.

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item.text_content())

Upvotes: 0

kjhughes

Reputation: 111726

You can use normalize-space() in the XPath. Then

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "normalize-space(//p[@class='something'])"

tree = html.fromstring(html_snippet)
print tree.xpath(xpath_query)

will yield

This class has some text and a few links in it.

Upvotes: 2

How to select text without the HTML markup

Answers (3)

Related Questions