SIM
SIM

Reputation: 22440

Unable to create an appropriate selector to parse a certain string

I've created a selector to scrape a certain string from some html elements. There are two strings within the elements. With my selector within the below script I can parse both of them whereas I expect to get the latter one which is in this case I wanna be scraped alone. How can i use any selector which will create a barrier for the first string to be parsed?

Here is the html elements:

html_elem="""
<a class="expected-content" href="/4570/I-wanna-be-scraped-alone">
    <span class="undesirable-content">I shouldn't be parsed</span>
    I wanna be scraped alone
</a>
"""

I tried with:

from lxml.html import fromstring

root = fromstring(html_elem)
for item in root.cssselect(".expected-content"):
    print(item.text_content())

Output I'm getting:

 I shouldn't be parsed
 I wanna be scraped alone

Expected output:

I wanna be scraped alone

Btw, I tried with root.cssselect(".expected-content:not(.undesirable-content)") this as well but it is definitely not the right approach. Any help would be highly appreciated.

Upvotes: 2

Views: 64

Answers (1)

Lukas Ansteeg
Lukas Ansteeg

Reputation: 341

For the specific example of this question, the best answer is:

for item in root.cssselect(".expected-content"):
    print(item.tail)

as element.tail returns text after the last child. This however will not work if the desired text is before or between child nodes. So a more robust solution is this:

item.text_content() according to documentation:

Returns the text content of the element, including the text content of its children, with no markup.

So, if you don't want the text of children, remove those first:

from lxml.html import fromstring

html_elem="""
<a class="expected-content" href="/4570/I-wanna-be-scraped-alone">
    <span class="undesirable-content">I shouldn't be parsed</span>
    I wanna be scraped alone
</a>
"""

root = fromstring(html_elem)
for item in root.cssselect(".expected-content"):
    for child in item:
        child.drop_tree()
    print(item.text_content())

Note that there is some white-space returned as well with this example, which I'm sure is easy to clean.

Upvotes: 1

Related Questions