SIM
SIM

Reputation: 22440

Scraper giving blank output

I've used a selector in my python script to fetch the text from some html elements given below. I tried with .text to get the Shop here cheap string from the elements but it doesn't work at all. However, when I try with .text_content() it works as it should.

My question is:

What's wrong with .text method? Why couldn't it parse the text from the elements?

Html elements:

<div class="Price__container">
    <span class="ProductPrice" itemprop="price">$6.35</span>
    <span class="ProductPrice_original">$6.70</span>
    Shop here cheap
</div>

What i tried with:

from lxml import html

tree = html.fromstring(element)
for data in tree.cssselect(".Price__container"):      
    print(data.text)           #It doesn't work at all

Btw, I don't wish to go on with .text_content() that is why I'm expecting any answer to scrape the text using .text instead. Thanks in advance.

Upvotes: 1

Views: 124

Answers (2)

SIM
SIM

Reputation: 22440

Another approach could be something like blow:

content="""
<div class="Price__container">
    <span class="ProductPrice" itemprop="price">$6.35</span>
    <span class="ProductPrice_original">$6.70</span>
    Shop here cheap
</div>
"""
from lxml import html

tree = html.fromstring(content)
for data in tree.cssselect(".Price__container"):
    for item in data:item.drop_tree()
    print(data.text.strip())

Output:

Shop here cheap

Upvotes: 0

alecxe
alecxe

Reputation: 473753

I think the root cause of the confusion is that lxml has this .text&.tail concept of representing a content of nodes which avoids having to have a special "text" node entity, to quote documentation:

The two properties .text and .tail are enough to represent any text content in an XML document. This way, the ElementTree API does not require any special text nodes in addition to the Element class, that tend to get in the way fairly often (as you might know from classic DOM APIs).

In your case, Shop here cheap is a tail of the <span class="ProductPrice_original">$6.70</span> element and, hence, is not included in the .text value of the parent node.

Aside from other methods, like .text_content(), you can reach the tail by getting all the top-level text nodes non-recursively:

print(''.join(data.xpath("./text()")).strip())

Or, get the last top-level text node:

print(data.xpath("./text()[last()]")[0].strip())

Upvotes: 1

Related Questions