hinsbergen
hinsbergen

Reputation: 147

Optimizing speed of parsing XML file using VTD-XML

I am in the process of parsing a lot of XML-files using VTD-XML. I am unsure whether I use the tool correctly - I think so, but parsing the files is taking me too long.

The xml-files (in DATEXII-format) are zipped files on the HD. Unpacked they are about 31MB large, containing just over 850.000 lines of text. I need to extract only a few fields and store them in a database.

import org.apache.commons.lang3.math.NumberUtils;
...

private static void test(File zipFile) throws XPathEvalException, NavException, XPathParseException {
    // init timer
    long step1=System.currentTimeMillis();

    // open file to output extracted fragments
    VTDGen vg = new VTDGen();
    vg.parseZIPFile(zipFile.getAbsolutePath(), zipFile.getName().replace(".zip",".xml"),true);

    VTDNav vn = vg.getNav();

    AutoPilot apSites = new AutoPilot();
    apSites.declareXPathNameSpace("ns1", "http://schemas.xmlsoap.org/soap/envelope/");
    apSites.selectXPath("/ns1:Envelope/ns1:Body/d2LogicalModel/payloadPublication/siteMeasurements");
    apSites.bind(vn);

    long step2=System.currentTimeMillis();
    System.out.println("Prep took "+(step2-step1)+"ms; ");

    // init variables
    String siteID, timeStr;
    boolean reliable;
    int index, flow, ctr=0;
    short speed;
    while(apSites.evalXPath()!=-1) {

        vn.toElement(VTDNav.FIRST_CHILD, "measurementSiteReference");
        siteID = vn.toString(vn.getText());

        // loop all measured values of this measurement site
        while(vn.toElement(VTDNav.NEXT_SIBLING, "measuredValue")) {
            ctr++;

            // extract index attribute
            index = NumberUtils.toInt(vn.toString(vn.getAttrVal("index")));

            // go one level deeper into basicDataValue
            vn.toElement(VTDNav.FIRST_CHILD, "basicDataValue");

            // we need either FIRST_CHILD or NEXT_SIBLING depending on whether we find something
            int next = VTDNav.FIRST_CHILD;
            if(vn.toElement(next, "time")) {
                timeStr = vn.toString(vn.getText());
                next = VTDNav.NEXT_SIBLING;
            }

            if(vn.toElement(next, "averageVehicleSpeed")) {
                speed = NumberUtils.toShort(vn.toString(vn.getText()));
                next = VTDNav.NEXT_SIBLING;
            }

            if(vn.toElement(next, "vehicleFlow")) {
                flow = NumberUtils.toInt(vn.toString(vn.getText()));
                next = VTDNav.NEXT_SIBLING;
            }

            if(vn.toElement(next, "fault")) { 
                reliable = vn.toString(vn.getText()).equals("0");
            }

            // insert into database here...

            if(next==VTDNav.NEXT_SIBLING) {
                vn.toElement(VTDNav.PARENT);
            }
            vn.toElement(VTDNav.PARENT);
        }

    }
    System.out.println("Loop took "+(System.currentTimeMillis()-step2)+"ms; ");
    System.out.println("Total number of measured values: "+ctr);
}

The output of the exact above function for my XML-files is:

Prep took 25756ms; 
Loop took 26889ms; 
Total number of measured values: 112611

No data is actually inserted into the database right now. Now the problem is that I receive one of these files every minute. The total parsing time is nearly 1 minute now, and because downloading the files takes about 10 seconds and I need to store stuff away in the database, I'm running behind real-time now.

Is there any way to speed this up? Things I've tried that didn't help:

Does anybody see a possibility to speed things up, or do I need to start thinking about a heavier machine / multi threading? Of course, 850.000 lines per minute (1.2 billion lines per day) is a lot, but I still do feel that it shouldn't take a minute to parse 31MB of data...

Upvotes: 4

Views: 1181

Answers (1)

Cam Connor
Cam Connor

Reputation: 1231

You could try unzipping the folder right away and storing the values of every xml file in an array with

File[] files = new File("foldername").listFiles();

and then you could make a loop that goes through every file, Im not sure if this would speed it up but its worth a shot.

Upvotes: 1

Related Questions