Reputation: 529

Parsing a partial XML with python lxml

I'm trying to parse a large XML file which is being received from the network in Python.

In order to do that, I get the data and pass it to lxml.etree.iterparse

However, if the XML has yet to fully be sent, like so:

<MyXML>
    <MyNode foo="bar">
    <MyNode foo="ba

If I run etree.iterparse(f, tag='MyNode').next() I get an XMLSyntaxError at whereever it was cut off.

Is there any way I can make it so I can receive the first tag (i.e. the first MyNode) and only get an exception when I reach that part of the document? (To make lxml really 'stream' the contents and not read the whole thing in the beginning).

Upvotes: 5

Answers (3)

Mikhail T.

Reputation: 3997

Ten years later, my "solution" is still to reparse the XML-file (or blob) from the beginning any time new data arrives (which in the example below is reported by select()). To avoid reacting to the already-processed "events", I keep count of the already reacted-to...

In the code below, processing of a new event consists of simply logging it. But it could be anything else, of course.

My only justification for the reparsing is that the files I'm dealing with are small -- no more than 100 elements, usually under 10. But the XML-text arrives sporadically and I want the new arrivals reported immediately, without waiting for the sending process to finish.

I wish, there was a way to tell xml.etree.ElementTree to resume parsing a file, for which it has thrown an error earlier, but there is not...

Maybe, we ought to use something under xml.sax.* instead...

The below code would work with Python-2.x and 3.x:

reported = 0    # count of the already-reported events
while True:
...
    r, w, x = select.select([reader], writers, [reader])
    if x:
        log.warn('Exceptions %s', x)
    if w:
    ...
    if not r:
        continue

    chunk = os.read(reader, 251)
    if not chunk:
        log.debug('There was nothing to read')
        continue
    ...
    # XXX: Here we repeatedly reparse the XML-text collected so
    # XXX: far -- cannot find another way to reliably report new entries.
    # XXX: To avoid reprinting the earlier entries, we keep count...
    count = 0
    try:
        # The logfile is not complete yet, so parsing will
        # eventually fail:
        for event, element in ET.iterparse(logfile):
            if event != 'end' or not element.text:
                continue
            count += 1
            if count > reported:
                log.info('%s: %s', element.tag, element.text)
    except SyntaxError:
        # In Python 2.6 this is a syntax-error
        pass
    except ET.ParseError:
        # In later Pythons this is a ParseError
        pass
    reported = count

Upvotes: 0

Sascha Gottfried

Reputation: 3329

Try to learn from the answers of two related questions to your problem. Find more wisdom in more related answers. Your problem is very common, may be you need to tweak it a bit to fit into a proven solution. Prefer that way to create a stable solution.

Upvotes: -3

tdelaney

Reputation: 77347

XMLPullParser and HTMLPullParser may better suite your needs. They get their data by repeated calls to parser.feed(data). You still have to wait until all of the data comes in before the tree is usable.

Upvotes: 2

Parsing a partial XML with python lxml

Answers (3)

Related Questions