user496934
user496934

Reputation: 4020

xml tool design problem

I have been asked this question at an interview. Ofcourse there are many approaches to the solution but just wanted to know if there is some really best approach that stands out. There is a huge xml file of 2gb that is stored in the hard disk of a low end PC having a 512 mb RAM. The xml file stores timestamps and corresponding string values. I have to design a tool that parses the xml file to get specific information, such as a string in a particular timestamp. The interviewer is not concerned about the searching technique in the tool. He wants to get a high level approach as to the design of the tool, considering only 512mn RAM and only 2GB size of the tool. Are there any interesting design appraches to this ?

Upvotes: 2

Views: 237

Answers (3)

bdoughan
bdoughan

Reputation: 149037

Instead of SAX, I would use the StAX APIs in Java SE 6 for this use case. The code below is from an answer of mine to a similar question. StAX is used to split a large XML file into several smaller files:

import java.io.*;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            File file = new File("out/" + xsr.getAttributeValue(null, "account") + ".xml");
            t.transform(new StAXSource(xsr), new StreamResult(file));
        }
    }

}

Below is similar answer by skaffman where here describes how StAX can be used to process an XML document in chunks. In his answer JAXB is used to process the chunks:

Upvotes: 1

RHT
RHT

Reputation: 5054

There are two approaches to XML parsing 1) using dom parser 2) using sax parser. Trying to parse a 2GB file with 512B RAM using dom parser is guaranteed to result in Out of Memory exception, therefore, go with sax parser which will also be faster as you already know what you are looking for.

Upvotes: 1

Patrick Bédert
Patrick Bédert

Reputation: 33

Maybe the parsing should be done with SAX instead of DOM. As with a DOM parser you have the complete document in memory before you access the data. If I understand you correct, then you already know the timestamps you are interested in from the beginning, so you could use a SAX Parser to get the corresponding string values, which should be faster and should not consume that much memory.

Upvotes: 1

Related Questions