Reputation: 540
I'm trying to remove everything in an XML Document between 2 tags, using python & lxml. the problem is that the tags can be in different branches of the tree (but always at the same depth) an example document might look like this.
<root>
<p> Hello world <start />this is a paragraph </p>
<p> Goodbye world. <end />I'm leaving now </p>
</root>
i'd like to remove everything between the start and end tags. which would result in a single p tag:
<root>
<p> Hello world I'm leaving now </p>
</root>
does anyone have any idea how this might be accomplished using lxml & python?
Upvotes: 1
Views: 1173
Reputation: 28676
You could try using the SAX-like target parser interface:
from lxml import etree
class SkipStartEndTarget:
def __init__(self, *args, **kwargs):
self.builder = etree.TreeBuilder()
self.skip = False
def start(self, tag, attrib, nsmap=None):
if tag == 'start':
self.skip = True
if not self.skip:
self.builder.start(tag, attrib, nsmap)
def data(self, data):
if not self.skip:
self.builder.data(data)
def comment(self, comment):
if not self.skip:
self.builder.comment(self)
def pi(self, target, data):
if not self.skip:
self.builder.pi(target, data)
def end(self, tag):
if not self.skip:
self.builder.end(tag)
if tag == 'end':
self.skip = False
def close(self):
self.skip = False
return self.builder.close()
You can then use the SkipStartEndTarget
class to make a parser target
, and create a custom XMLParser
with that target, like this:
parser = etree.XMLParser(target=SkipStartEndTarget())
You can still provide other parser options to the parser if you need them. Then you can provide this parser to the parser function you are using, for example:
elem = etree.fromstring(xml_str, parser=parser)
This also works with etree.XML()
and etree.parse()
, and you can even set the parser as the default parser with etree.setdefaultparser()
(which is probably not a good idea). One thing that might trip you: even with etree.parse()
, this will not return an elementtree, but always an element (as etree.XML()
and etree.fromstring()
do). I don't think this can be done (yet), so if this is an issue to you, you will have to work around it somehow.
Note that it is also possible to use create an elementtree from sax events, with lxml.sax, which is probably somewhat more difficult and slower. Contrary to the above example, it will return an elementtree, but I think it doesn't provide the .docinfo
you would get when using etree.parse()
normally. I also believe it (currently) doesn't support comments and pi's. (haven't used it yet, so I can't be more precise at the moment)
Also note that any SAX-like approach to parsing the document requires that skipping everything between <start/>
and <end/>
will still result in a well-formed document, which is the case in your example, but would not be the case if the second <p>
was a <p2>
for example, as you'd end up with <p>....</p2>
.
Upvotes: 0
Reputation: 85468
I know there are some people who'll want to stone me for this, but you could just use regex:
import re
new_string = re.sub(r'<start />(.*?)<end />', '', your_string, re.S)
You can't use an XML parser when it's not valid XML.
Upvotes: 1
Reputation: 43497
You've got a mess on your hands and should slap the person who wrote an intentional perversion of the XML nesting rule.
You are probably best of using something like SAX to recognize the <start/>
tag and begin discarding input until you hit an <end/>
. SAX has the advantage over lxml here because it allows you to take arbitrary actions per lexeme while lxml will have already divorced start and end before you get to touch them.
While you're at it, you might want to convert those documents to usable XML.
Upvotes: 1