wolfgang
wolfgang

Reputation: 7789

XPATH - how to get inner text data littered with <br> tags?

I have HTML text like this

<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>


<othertag>
 data
</othertag>
<moretag>
 data
</moretag>

I'm trying to query the following with XPATH

//p//text() | //othertag//text() | //moretag//text()

which gives me text which is broken at the point of each <br> tag

like this

('This is some important data','Even this is data','this is useful too','othetag text data','moretag text data')

I'd like it as a complete string,

('This is some important data Even this is data this is useful too')

because i'll be querying other elements using | Union XPATH operators and its very important this text content is properly divided

How can i do this?

If this is impossible,

can i atleast get the inner HTML of <p> somehow

So that i can textually store it as

This is some important data<br>Even this is data<br>this is useful too

I'm using lxml.html in Python 2.7

Upvotes: 0

Views: 3004

Answers (2)

FatalError
FatalError

Reputation: 54541

You can also expose your own functions in XPath:

import lxml.html, lxml.etree

raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''

doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)

def cat(context, a):
    return [''.join(a)]
ns['cat'] = cat

print repr(doc.xpath('cat(//p/text())'))

which prints

'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'

You can perform the transformations however you like using this method.

Upvotes: 2

larsks
larsks

Reputation: 311238

Update

Based on your edit, maybe you can use the XPath string() function. For example:

>>> doc.xpath('string(//p)')
'\n    This is some important data\n    \n    Even this is data\n    \n    this is useful too\n  '

(original answer follows)

If you're getting back the text you want in multiple pieces:

('This is some important data','Even this is data','this is useful too')

Why not just join those pieces?

>>> ' '.join(doc.xpath('//p/text()'))
['\n    This is some important data\n    ', '\n    Even this is data\n    ', '\n    this is useful too\n  ']

You can even get rid of the line breaks:

>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'

If you wanted the "inner html" of the p element, you could call lxml.etree.tostring on all of it's children:

>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n    Even this is data\n    <br/>\n    this is useful too\n  '

NB: All of these examples assume:

>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
...    parser=etree.HTMLParser())

Upvotes: 2

Related Questions