Dov Grobgeld
Dov Grobgeld

Reputation: 4983

How to support recursive include when parsing xml

I'm defining an xml schema of my own which supports the additional tag "insert_tag", which when reached should insert the text file at that point in the stream and then continue the parsing:

Here is an example:

my.xml:

<xml> Something <insert_file name="foo.html"/> or another </xml>

I'm using xmlreader as follows:

 class HtmlHandler(xml.sax.handler.ContentHandler):

    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)

 parser = xml.sax.make_parser()
 parser.setContentHandle(HtmlHandler())

 parser.parse(StringIO(html))

The question is how do I insert the included contents directly into the parsing stream? Of course I could recursively build up the non-interpolated text by repeatedly inserting included text, but that means that I have to parse the xml multiple times.

I tried replacing StringIO(html) with my own stream that allows inserting contents mid stream, but it doesn't work because the sax parser reads the stream buffered.

Update:

I did find a solution that is hackish at the best. It is built on the following stream class:

class InsertReader():
    """A reader class that supports the concept of pushing another
    reader in the middle of the use of a first reader. This may
    be used for supporting insertion commands."""
    def __init__(self):
        self.reader_stack = []

    def push(self,reader):
        self.reader_stack += [reader]

    def pop(self):
        self.reader_stack.pop()

    def __iter__(self):
        return self

    def read(self,n=-1):
        """Read from the top most stack element. Never trancends elements.
        Should it?

        The code below is a hack. It feeds only a single token back to
        the reader.
        """
        while len(self.reader_stack)>0:
            # Return a single token
            ret_text = StringIO()
            state = 0
            while 1:
                c = self.reader_stack[-1].read(1)
                if c=='':
                    break

                ret_text.write(c)
                if c=='>':
                    break

            ret_text = ret_text.getvalue()
            if ret_text == '':
                self.reader_stack.pop()
                continue
            return ret_text
        return ''

    def next(self):
        while len(self.reader_stack)>0:
            try:
                v = self.reader_stack[-1].next()
            except StopIteration:
                self.reader_stack.pop()
                continue
            return v
        raise StopIteration

This class creates a stream structure that restricts the amount of characters that are returned to the user of the stream. I.e. even if the xml parser does read(16386) the class will only return bytes up to the next '>' character. Since the '>' character also signifies the end of tags, we have the opportunity to inject our recursive include into the stream at this point.

What is hackish about this solution is the following:

This solves the problem for me, but I'm still interested in a more beautiful solution.

Upvotes: 3

Views: 1024

Answers (1)

Steven
Steven

Reputation: 28696

Have you considered using xinclude? The lxml library has builtin support for it.

Upvotes: 0

Related Questions