Reputation: 529
I'm trying to parse a large XML file which is being received from the network in Python.
In order to do that, I get the data and pass it to lxml.etree.iterparse
However, if the XML has yet to fully be sent, like so:
<MyXML>
<MyNode foo="bar">
<MyNode foo="ba
If I run etree.iterparse(f, tag='MyNode').next()
I get an XMLSyntaxError
at whereever it was cut off.
Is there any way I can make it so I can receive the first tag (i.e. the first MyNode) and only get an exception when I reach that part of the document? (To make lxml really 'stream' the contents and not read the whole thing in the beginning).
Upvotes: 5
Views: 2756
Reputation: 3997
Ten years later, my "solution" is still to reparse the XML-file (or blob) from the beginning any time new data arrives (which in the example below is reported by select()
). To avoid reacting to the already-processed "events", I keep count of the already reacted-to...
In the code below, processing of a new event consists of simply logging it. But it could be anything else, of course.
My only justification for the reparsing is that the files I'm dealing with are small -- no more than 100 elements, usually under 10. But the XML-text arrives sporadically and I want the new arrivals reported immediately, without waiting for the sending process to finish.
I wish, there was a way to tell xml.etree.ElementTree
to resume parsing a file, for which it has thrown an error earlier, but there is not...
Maybe, we ought to use something under xml.sax.*
instead...
The below code would work with Python-2.x and 3.x:
reported = 0 # count of the already-reported events
while True:
...
r, w, x = select.select([reader], writers, [reader])
if x:
log.warn('Exceptions %s', x)
if w:
...
if not r:
continue
chunk = os.read(reader, 251)
if not chunk:
log.debug('There was nothing to read')
continue
...
# XXX: Here we repeatedly reparse the XML-text collected so
# XXX: far -- cannot find another way to reliably report new entries.
# XXX: To avoid reprinting the earlier entries, we keep count...
count = 0
try:
# The logfile is not complete yet, so parsing will
# eventually fail:
for event, element in ET.iterparse(logfile):
if event != 'end' or not element.text:
continue
count += 1
if count > reported:
log.info('%s: %s', element.tag, element.text)
except SyntaxError:
# In Python 2.6 this is a syntax-error
pass
except ET.ParseError:
# In later Pythons this is a ParseError
pass
reported = count
Upvotes: 0
Reputation: 3329
Try to learn from the answers of two related questions to your problem. Find more wisdom in more related answers. Your problem is very common, may be you need to tweak it a bit to fit into a proven solution. Prefer that way to create a stable solution.
Upvotes: -3
Reputation: 77347
XMLPullParser and HTMLPullParser may better suite your needs. They get their data by repeated calls to parser.feed(data)
. You still have to wait until all of the data comes in before the tree is usable.
Upvotes: 2