Reputation: 19645
I am trying to do this using lxml, but utlimately it is a question about the proper xpath.
I'd like to select from the <pgBreak>
element until the end of its parent, in this case <p
>
<root>
<pgBreak pgId="1"/>
<p>
some text to fill out a para
<pgBreak pgId="2"/>
some more text
<quote> A quoted block </quote>
remainder of para
</p>
</root>
<root>
<pgBreak pgId="1"/>
<p>
some text to fill out a para
</p>
<pgBreak pgId="2"/>
<p>
some more text
<quote> A quoted block </quote>
remainder of para
</p>
</root>
Upvotes: 0
Views: 1054
Reputation: 19601
What you are trying to do is not trivial: not only do you want to match 'pgBreak' elements and all subsequent siblings, you then want to move them outside of the parent scope and wrap the siblings in a 'p' element. Fun stuff.
The following code should give you an idea how to achieve that (DISCLAIMER: example only, needs clean-up, edge cases probably not handled). Code is deliberately uncommented so you have to figure it out :)
I've modified the input XML slightly to illustrate the functionality better.
import lxml.etree
text = """
<root>
<pgBreak pgId="1"/>
<p>
some text to fill out a para
<pgBreak pgId="2"/>
some more text
<quote> A quoted block </quote>
remainder of para
<pgBreak pgId="3"/>
<p>
blurb
</p>
</p>
</root>
"""
root = lxml.etree.fromstring(text)
for pgbreak in root.xpath('//pgBreak'):
inner = pgbreak.getparent()
if inner == root:
continue
outer = inner.getparent()
pgbreak_index = inner.index(pgbreak)
inner_index = outer.index(inner) + 1
siblings = inner[pgbreak_index+1:]
inner.remove(pgbreak)
outer.insert(inner_index,pgbreak)
if siblings[0].tag != 'p':
p = lxml.etree.Element('p')
p.text = pgbreak.tail
pgbreak.tail = None
for node in siblings:
p.append(node)
outer.insert(inner_index+1,p)
else:
for node in siblings:
inner_index += 1
outer.insert(inner_index,node)
Output is:
<root>
<pgBreak pgId="1"/>
<p>
some text to fill out a para
</p>
<pgBreak pgId="2"/>
<p>
some more text
<quote> A quoted block </quote>
remainder of para
</p>
<pgBreak pgId="3"/>
<p>
blurb
</p>
</root>
Upvotes: 1