Reputation: 7930
I want to parse an HTML document like this with requests-html 0.9.0:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('.data', first=True)
print(data.html)
# <span class="data">important data</span> and some rubbish
print(data.text)
# important data and some rubbish
I need to distinguish the text inside the tag (enclosed by it) from the tag's tail (the text that follows the element up to the next tag). This is the behaviour I initially expected:
data.text == 'important data'
data.tail == ' and some rubbish'
But tail
is not defined for Element
s. Since requests-html provides access to inner lxml
objects, we can try to get it from lxml.etree.Element.tail
:
from lxml.etree import tostring
print(tostring(data.lxml))
# b'<html><span class="data">important data</span></html>'
print(data.lxml.tail is None)
# True
There's no tail in lxml representation! The tag with its inner text is OK, but the tail seems to be stripped away. How do I extract 'and some rubbish'
?
Edit: I discovered that full_text
provides the inner text only (so much for “full”). This enables a dirty hack of subtracting full_text
from text
, although I'm not positive it will work if there are any links.
print(data.full_text)
# important data
Upvotes: 4
Views: 1984
Reputation: 65
the tail property exists with objects of type 'lxml.html.HtmlElement'.
I think what you are asking for is very easy to implement.
Here is a very simple example using requests_html and lxml:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('span')
print (data[0].text) # important data and some rubbish
print (data[-1].text) # important data
print (data[-1].element.tail) # and some rubbish
The element attribute points to the 'lxml.html.HtmlElement' object.
Hope this helps.
Upvotes: 0
Reputation: 52695
I'm not sure I've understood your problem, but if you just want to get 'and some rubbish'
you can use below code:
from requests_html import HTML
from lxml.html import fromstring
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = fromstring(html.html)
# or without using requests_html.HTML: data = fromstring('<span><span class="data">important data</span> and some rubbish</span>')
print(data.xpath('//span[span[@class="data"]]/text()')[-1]) # " and some rubbish"
NOTE that data = html.find('.data', first=True)
returns you <span class="data">important data</span>
node which doesn't contain " and some rubbish"
- it's a text child node of parent span
!
Upvotes: 3