Jannat Arora
Jannat Arora

Reputation: 2989

Efficient XML parsing for 25GB data

My aim is to parse 25 GB of XML data. An example of such a data is given below:

<Document>
<Data Id='12' category='1'  Body="abc"/>
<Data Id='13' category='1'  Body="zwq"/>
.
.
<Data Id='82018030' category='2' CorrespondingCategory1Id='13' Body="pqr"/>

However..considering the data I have of "25 GB"...my approach is quite inefficient. Please suggest some way to improve my code or an alternate approach. Also kindly include a small example code to make things clearer.

Upvotes: 2

Views: 1059

Answers (6)

Jannat Arora
Jannat Arora

Reputation: 2989

Try and use iterparse from lxml. I think it will suit the problem which you wish to handle.

Upvotes: 0

user1096188
user1096188

Reputation: 1839

If the IDs are in ascending order, then you can roll out your own function that reads an element at any position in a file. Then you can just scan the entire file and for every element you can find the corresponding one using binary search algorithm. The thing will run in O(n log n) while still using negligible amount of memory.

Upvotes: 0

Michael Kay
Michael Kay

Reputation: 163625

In XSLT 3.0 (draft) as currently implemented in Saxon-EE, you can write a streaming transformation that solves this problem as follows:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:map="http://www.w3.org/2005/xpath-functions/map">
<xsl:mode streamable="yes"/>
<xsl:template match="/">
  <xsl:iterate select="Document/Data">
    <xsl:param name="map" select="map{}"/>
    <xsl:choose>
      <xsl:when test="@category='1'">
        <xsl:next-iteration>
          <xsl:with-param name="map" select="map:put($map, string(@Id), string(@Body))"/>
        </xsl:next-iteration>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="'Cat1 Body: ', 
                              $map(@CorrespondingCategoryId), 'Cat2 Body', @Body"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:iterate>
</xsl:template>

I haven't tested this (it's late at night on the eve of a four-day holiday...) but if you are interested in pursuing this approach I'll be happy to help. XSLT 3.0 is still a draft spec and fairly fluid. Its focus is on solving problems like this one using a streaming approach that handles very large documents using bounded memory. Saxon-EE 9.4 implements a snapshot of the spec.

Upvotes: 0

jagill
jagill

Reputation: 692

Your initial algorithm runs in O(n^2), which will be very slow for 25GB of data. Ideally you will get it down to O(n) or O(n log n). In the absence of any other information about the data (like whether category 1 or 2 is smaller, etc), you can do something like this (which is O(n)):

from lxml import objectify
f=open('myfile25GB', 'r')
text=f.read()
root=objectify.fromstring(text)

cat_one_bodies = {}
for e in root.attrib['Document'].row:
    category = int(e.attrib['category'])
    body = e.attrib['Body']
    if category == 1:
        e_id = int(e.attrib['Id'])
        cat_one_bodies[e_id] = body
    else: #Assuming there are only 2 categories
        cat_one_id = int(e.attrib['CorrespondingCategory1Id'])
        print "Cat1 Body: '%s' Cat2 Body: '%s'" % (body, cat_one_bodies[cat_one_id])

Although this doesn't parse your file, hopefully it shows you the idea. It potentially uses quite a bit of memory (since it's maintaining all the category1 bodies in the dictionary), so that may be a consideration.

Upvotes: 0

kindall
kindall

Reputation: 184395

You might find that a SAX parser works better for this task. Rather than building a DOM, a SAX parser turns the XML file into a stream of elements and calls functions you provide to let you handle each element.

The good thing is that SAX parsers can be very fast and memory-efficient compared to DOM parsers, and some don't even need to be given all the XML at once, which would be ideal when you have 25 GB of it.

Unfortunately, if you need any context information, like "I want tag <B> but only if it's inside tag <A>," you must maintain it yourself, since all the parser gives you is "start tag <A>, start tag <B>, end tag <B>, end tag <A>." It never explicitly tells you that tag <B> is inside tag <A>, you have to figure that out from what you saw. And once you have seen an element, it's gone unless you remembered it yourself.

This gets very hairy for complex parsing jobs, but yours is probably manageable.

It happens that Python's standard library has a SAX parser in xml.sax. You probably want something like xml.sax.xmlreader.IncrementalParser.

Upvotes: 4

RotaJota
RotaJota

Reputation: 841

My first suggestion from looking at your question would be to use a relational database such as MySQL or sqlite. It wouldn't be hard to get your XML data into this form, and then querying on that data would be both more straightforward and faster.

Upvotes: 0

Related Questions