Reputation: 22440
I've created a selector to scrape a certain string from some html elements. There are two strings within the elements. With my selector within the below script I can parse both of them whereas I expect to get the latter one which is in this case I wanna be scraped alone
. How can i use any selector which will create a barrier for the first string to be parsed?
Here is the html elements:
html_elem="""
<a class="expected-content" href="/4570/I-wanna-be-scraped-alone">
<span class="undesirable-content">I shouldn't be parsed</span>
I wanna be scraped alone
</a>
"""
I tried with:
from lxml.html import fromstring
root = fromstring(html_elem)
for item in root.cssselect(".expected-content"):
print(item.text_content())
Output I'm getting:
I shouldn't be parsed
I wanna be scraped alone
Expected output:
I wanna be scraped alone
Btw, I tried with root.cssselect(".expected-content:not(.undesirable-content)")
this as well but it is definitely not the right approach. Any help would be highly appreciated.
Upvotes: 2
Views: 64
Reputation: 341
For the specific example of this question, the best answer is:
for item in root.cssselect(".expected-content"):
print(item.tail)
as element.tail
returns text after the last child. This however will not work if the desired text is before or between child nodes. So a more robust solution is this:
item.text_content()
according to documentation:
Returns the text content of the element, including the text content of its children, with no markup.
So, if you don't want the text of children, remove those first:
from lxml.html import fromstring
html_elem="""
<a class="expected-content" href="/4570/I-wanna-be-scraped-alone">
<span class="undesirable-content">I shouldn't be parsed</span>
I wanna be scraped alone
</a>
"""
root = fromstring(html_elem)
for item in root.cssselect(".expected-content"):
for child in item:
child.drop_tree()
print(item.text_content())
Note that there is some white-space returned as well with this example, which I'm sure is easy to clean.
Upvotes: 1