Reputation:
I am working on a web scraper (using Python), so I have a chunk of HTML from which I am trying to extract text. One of the snippets looks something like this:
<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>
I want to extract the text from this class. Now, I could use something along the lines of
//p[@class='something')]//text()
but this leads to each chunk of text ending up as a separate result element, like this:
(This class has some ,text, and a few ,links, in it.)
The desired output would contain all the text in one element, like this:
This class has some text and a few links in it.
Is there an easy or elegant way to achieve this?
Edit: Here's the code that produces the result given above.
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "//p[@class='something']//text()"
tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
print "'{0}'".format(item)
Upvotes: 2
Views: 398
Reputation: 928
An alternate one-liner on your original code: use a join
with an empty string separator:
print("".join(query_results))
Upvotes: 0
Reputation: 168836
You could call .text_content()
on the lxml Element, instead of fetching the text with XPath.
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "//p[@class='something']"
tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
print "'{0}'".format(item.text_content())
Upvotes: 0
Reputation: 111726
You can use normalize-space()
in the XPath. Then
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "normalize-space(//p[@class='something'])"
tree = html.fromstring(html_snippet)
print tree.xpath(xpath_query)
will yield
This class has some text and a few links in it.
Upvotes: 2