Reputation: 147
I am in the process of parsing a lot of XML-files using VTD-XML. I am unsure whether I use the tool correctly - I think so, but parsing the files is taking me too long.
The xml-files (in DATEXII-format) are zipped files on the HD. Unpacked they are about 31MB large, containing just over 850.000 lines of text. I need to extract only a few fields and store them in a database.
import org.apache.commons.lang3.math.NumberUtils;
...
private static void test(File zipFile) throws XPathEvalException, NavException, XPathParseException {
// init timer
long step1=System.currentTimeMillis();
// open file to output extracted fragments
VTDGen vg = new VTDGen();
vg.parseZIPFile(zipFile.getAbsolutePath(), zipFile.getName().replace(".zip",".xml"),true);
VTDNav vn = vg.getNav();
AutoPilot apSites = new AutoPilot();
apSites.declareXPathNameSpace("ns1", "http://schemas.xmlsoap.org/soap/envelope/");
apSites.selectXPath("/ns1:Envelope/ns1:Body/d2LogicalModel/payloadPublication/siteMeasurements");
apSites.bind(vn);
long step2=System.currentTimeMillis();
System.out.println("Prep took "+(step2-step1)+"ms; ");
// init variables
String siteID, timeStr;
boolean reliable;
int index, flow, ctr=0;
short speed;
while(apSites.evalXPath()!=-1) {
vn.toElement(VTDNav.FIRST_CHILD, "measurementSiteReference");
siteID = vn.toString(vn.getText());
// loop all measured values of this measurement site
while(vn.toElement(VTDNav.NEXT_SIBLING, "measuredValue")) {
ctr++;
// extract index attribute
index = NumberUtils.toInt(vn.toString(vn.getAttrVal("index")));
// go one level deeper into basicDataValue
vn.toElement(VTDNav.FIRST_CHILD, "basicDataValue");
// we need either FIRST_CHILD or NEXT_SIBLING depending on whether we find something
int next = VTDNav.FIRST_CHILD;
if(vn.toElement(next, "time")) {
timeStr = vn.toString(vn.getText());
next = VTDNav.NEXT_SIBLING;
}
if(vn.toElement(next, "averageVehicleSpeed")) {
speed = NumberUtils.toShort(vn.toString(vn.getText()));
next = VTDNav.NEXT_SIBLING;
}
if(vn.toElement(next, "vehicleFlow")) {
flow = NumberUtils.toInt(vn.toString(vn.getText()));
next = VTDNav.NEXT_SIBLING;
}
if(vn.toElement(next, "fault")) {
reliable = vn.toString(vn.getText()).equals("0");
}
// insert into database here...
if(next==VTDNav.NEXT_SIBLING) {
vn.toElement(VTDNav.PARENT);
}
vn.toElement(VTDNav.PARENT);
}
}
System.out.println("Loop took "+(System.currentTimeMillis()-step2)+"ms; ");
System.out.println("Total number of measured values: "+ctr);
}
The output of the exact above function for my XML-files is:
Prep took 25756ms;
Loop took 26889ms;
Total number of measured values: 112611
No data is actually inserted into the database right now. Now the problem is that I receive one of these files every minute. The total parsing time is nearly 1 minute now, and because downloading the files takes about 10 seconds and I need to store stuff away in the database, I'm running behind real-time now.
Is there any way to speed this up? Things I've tried that didn't help:
Does anybody see a possibility to speed things up, or do I need to start thinking about a heavier machine / multi threading? Of course, 850.000 lines per minute (1.2 billion lines per day) is a lot, but I still do feel that it shouldn't take a minute to parse 31MB of data...
Upvotes: 4
Views: 1181
Reputation: 1231
You could try unzipping the folder right away and storing the values of every xml file in an array with
File[] files = new File("foldername").listFiles();
and then you could make a loop that goes through every file, Im not sure if this would speed it up but its worth a shot.
Upvotes: 1