Reputation: 7789
I have HTML text like this
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
<othertag>
data
</othertag>
<moretag>
data
</moretag>
I'm trying to query the following with XPATH
//p//text() | //othertag//text() | //moretag//text()
which gives me text which is broken at the point of each <br>
tag
like this
('This is some important data','Even this is data','this is useful too','othetag text data','moretag text data')
I'd like it as a complete string,
('This is some important data Even this is data this is useful too')
because i'll be querying other elements using |
Union XPATH operators and its very important this text content is properly divided
How can i do this?
If this is impossible,
can i atleast get the inner HTML of <p>
somehow
So that i can textually store it as
This is some important data<br>Even this is data<br>this is useful too
I'm using lxml.html
in Python 2.7
Upvotes: 0
Views: 3004
Reputation: 54541
You can also expose your own functions in XPath:
import lxml.html, lxml.etree
raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''
doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)
def cat(context, a):
return [''.join(a)]
ns['cat'] = cat
print repr(doc.xpath('cat(//p/text())'))
which prints
'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'
You can perform the transformations however you like using this method.
Upvotes: 2
Reputation: 311238
Update
Based on your edit, maybe you can use the XPath string()
function. For example:
>>> doc.xpath('string(//p)')
'\n This is some important data\n \n Even this is data\n \n this is useful too\n '
(original answer follows)
If you're getting back the text you want in multiple pieces:
('This is some important data','Even this is data','this is useful too')
Why not just join those pieces?
>>> ' '.join(doc.xpath('//p/text()'))
['\n This is some important data\n ', '\n Even this is data\n ', '\n this is useful too\n ']
You can even get rid of the line breaks:
>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'
If you wanted the "inner html" of the p
element, you could call lxml.etree.tostring
on all of it's children:
>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n Even this is data\n <br/>\n this is useful too\n '
NB: All of these examples assume:
>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
... parser=etree.HTMLParser())
Upvotes: 2