Reputation: 22440
I've used a selector in my python script to fetch the text from some html elements given below. I tried with .text
to get the Shop here cheap
string from the elements but it doesn't work at all. However, when I try with .text_content()
it works as it should.
My question is:
What's wrong with .text
method? Why couldn't it parse the text from the elements?
Html elements:
<div class="Price__container">
<span class="ProductPrice" itemprop="price">$6.35</span>
<span class="ProductPrice_original">$6.70</span>
Shop here cheap
</div>
What i tried with:
from lxml import html
tree = html.fromstring(element)
for data in tree.cssselect(".Price__container"):
print(data.text) #It doesn't work at all
Btw, I don't wish to go on with .text_content()
that is why I'm expecting any answer to scrape the text using .text
instead. Thanks in advance.
Upvotes: 1
Views: 124
Reputation: 22440
Another approach could be something like blow:
content="""
<div class="Price__container">
<span class="ProductPrice" itemprop="price">$6.35</span>
<span class="ProductPrice_original">$6.70</span>
Shop here cheap
</div>
"""
from lxml import html
tree = html.fromstring(content)
for data in tree.cssselect(".Price__container"):
for item in data:item.drop_tree()
print(data.text.strip())
Output:
Shop here cheap
Upvotes: 0
Reputation: 473753
I think the root cause of the confusion is that lxml
has this .text
&.tail
concept of representing a content of nodes which avoids having to have a special "text" node entity, to quote documentation:
The two properties .text and .tail are enough to represent any text content in an XML document. This way, the ElementTree API does not require any special text nodes in addition to the Element class, that tend to get in the way fairly often (as you might know from classic DOM APIs).
In your case, Shop here cheap
is a tail of the <span class="ProductPrice_original">$6.70</span>
element and, hence, is not included in the .text
value of the parent node.
Aside from other methods, like .text_content()
, you can reach the tail by getting all the top-level text nodes non-recursively:
print(''.join(data.xpath("./text()")).strip())
Or, get the last top-level text node:
print(data.xpath("./text()[last()]")[0].strip())
Upvotes: 1