creating large csv files in Java getting really slow

Question

i have a performance problem when trying to create a csv file starting from another csv file. this is how the original file looks:

country,state,co,olt,olu,splitter,ont,cpe,cpe.latitude,cpe.longitude,cpe.customer_class,cpe.phone,cpe.ip,cpe.subscriber_id
COUNTRY-0001,STATE-0001,CO-0001,OLT-0001,OLU0001,SPLITTER-0001,ONT-0001,CPE-0001,28.21487,77.451775,ALL,SIP:+674100002743@IMS.COMCAST.NET,SIP:E28EDADA06B2@IMS.COMCAST.NET,CPE_SUBSCRIBER_ID-QHLHW4
COUNTRY-0001,STATE-0002,CO-0002,OLT-0002,OLU0002,SPLITTER-0002,ONT-0002,CPE-0002,28.294018,77.068924,ALL,SIP:+796107443092@IMS.COMCAST.NET,SIP:58DD999D6466@IMS.COMCAST.NET,CPE_SUBSCRIBER_ID-AH8NJQ

potentially it could be millions of lines like this, i have detected the problem with 1.280.000 lines.

this is the algorithm:

File csvInputFile = new File(csv_path);
int blockSize = 409600;
brCsvInputFile = new BufferedReader(frCsvInputFile, blockSize);

String line = null;
StringBuilder sbIntermediate = new StringBuilder();
skipFirstLine(brCsvInputFile);
while ((line = brCsvInputFile.readLine()) != null) {
    createIntermediateStringBuffer(sbIntermediate, line.split(REGEX_COMMA));
}


private static void skipFirstLine(BufferedReader br) throws IOException {
    String line = br.readLine();
    String[] splitLine = line.split(REGEX_COMMA);
    LOGGER.debug("First line detected! ");
    createIndex(splitLine);
    createIntermediateIndex(splitLine);
}

private static void createIndex(String[] splitLine) {
    LOGGER.debug("START method createIndex.");
    for (int i = 0; i < splitLine.length; i++)
        headerIndex.put(splitLine[i], i);
    printMap(headerIndex);
    LOGGER.debug("COMPLETED method createIndex.");
}

    private static void createIntermediateIndex(String[] splitLine) {

    LOGGER.debug("START method createIntermediateIndex.");
    com.tekcomms.c2d.xml.model.v2.Metadata_element[] metadata_element = null;
    String[] servicePath = newTopology.getElement().getEntity().getService_path().getLevel();

    if (newTopology.getElement().getMetadata() != null)
        metadata_element = newTopology.getElement().getMetadata().getMetadata_element();

    LOGGER.debug(servicePath.toString());
    LOGGER.debug(metadata_element.toString());

    headerIntermediateIndex.clear();
    int indexIntermediateId = 0;
    for (int i = 0; i < servicePath.length; i++) {
        String level = servicePath[i];
        LOGGER.debug("level is: " + level);
        headerIntermediateIndex.put(level, indexIntermediateId);
        indexIntermediateId++;
        // its identificator is going to be located to the next one
        headerIntermediateIndex.put(level + "ID", indexIntermediateId);
        indexIntermediateId++;
    }
    // adding cpe.latitude,cpe.longitude,cpe.customer_class, it could be
    // better if it would be metadata as well.
    String labelLatitude = newTopology.getElement().getEntity().getLatitude();
    // indexIntermediateId++;
    headerIntermediateIndex.put(labelLatitude, indexIntermediateId);
    String labelLongitude = newTopology.getElement().getEntity().getLongitude();
    indexIntermediateId++;
    headerIntermediateIndex.put(labelLongitude, indexIntermediateId);
    String labelCustomerClass = newTopology.getElement().getCustomer_class();
    indexIntermediateId++;
    headerIntermediateIndex.put(labelCustomerClass, indexIntermediateId);

    // adding metadata
    // cpe.phone,cpe.ip,cpe.subscriber_id,cpe.vendor,cpe.model,cpe.customer_status,cpe.contact_telephone,cpe.address,
    // cpe.city,cpe.state,cpe.zip,cpe.bootfile,cpe.software_version,cpe.hardware_version
    // now i need to iterate over each Metadata_element belonging to
    // topology.element.metadata
    // are there any metadata?
    if (metadata_element != null && metadata_element.length != 0)
        for (int j = 0; j < metadata_element.length; j++) {
            String label = metadata_element[j].getLabel();
            label = label.toLowerCase();
            LOGGER.debug(" ==label: " + label + " index_pos: " + j);
            indexIntermediateId++;
            headerIntermediateIndex.put(label, indexIntermediateId);
        }

    printMap(headerIntermediateIndex);
    LOGGER.debug("COMPLETED method createIntermediateIndex.");
}

Reading the entire dataset, 1.280.000 lines take 800 ms! so the problem is in this method

    private static void createIntermediateStringBuffer(StringBuilder sbIntermediate, String[] splitLine) throws ClassCastException,
        NullPointerException {

    LOGGER.debug("START method createIntermediateStringBuffer.");
    long start, end;
    start = System.currentTimeMillis();
    ArrayList hashes = new ArrayList();
    com.tekcomms.c2d.xml.model.v2.Metadata_element[] metadata_element = null;

    String[] servicePath = newTopology.getElement().getEntity().getService_path().getLevel();
    LOGGER.debug(servicePath.toString());

    if (newTopology.getElement().getMetadata() != null) {
        metadata_element = newTopology.getElement().getMetadata().getMetadata_element();
        LOGGER.debug(metadata_element.toString());
    }

    for (int i = 0; i < servicePath.length; i++) {
        String level = servicePath[i];
        LOGGER.debug("level is: " + level);
        if (splitLine.length > getPositionFromIndex(level)) {
            String name = splitLine[getPositionFromIndex(level)];
            sbIntermediate.append(name);
            hashes.add(name);
            sbIntermediate.append(REGEX_COMMA).append(HashUtils.calculateHash(hashes)).append(REGEX_COMMA);
            LOGGER.debug(" ==sbIntermediate: " + sbIntermediate.toString());
        }
    }

    //      end=System.currentTimeMillis();
    //      LOGGER.info("COMPLETED adding name hash. " + (end - start) + " ms. " + (end - start) / 1000 + " seg.");
    // adding cpe.latitude,cpe.longitude,cpe.customer_class, it should be
    // better if it would be metadata as well.
    String labelLatitude = newTopology.getElement().getEntity().getLatitude();
    if (splitLine.length > getPositionFromIndex(labelLatitude)) {
        String lat = splitLine[getPositionFromIndex(labelLatitude)];
        sbIntermediate.append(lat).append(REGEX_COMMA);
    }

    String labelLongitude = newTopology.getElement().getEntity().getLongitude();
    if (splitLine.length > getPositionFromIndex(labelLongitude)) {
        String lon = splitLine[getPositionFromIndex(labelLongitude)];
        sbIntermediate.append(lon).append(REGEX_COMMA);
    }
    String labelCustomerClass = newTopology.getElement().getCustomer_class();
    if (splitLine.length > getPositionFromIndex(labelCustomerClass)) {
        String customerClass = splitLine[getPositionFromIndex(labelCustomerClass)];
        sbIntermediate.append(customerClass).append(REGEX_COMMA);
    }
    //      end=System.currentTimeMillis();
    //      LOGGER.info("COMPLETED adding lat,lon,customer. " + (end - start) + " ms. " + (end - start) / 1000 + " seg.");
    // watch out metadata are optional, it can appear as a void chain!
    if (metadata_element != null && metadata_element.length != 0)
        for (int j = 0; j < metadata_element.length; j++) {
            String label = metadata_element[j].getLabel();
            LOGGER.debug(" ==label: " + label + " index_pos: " + j);
            if (splitLine.length > getPositionFromIndex(label)) {
                String actualValue = splitLine[getPositionFromIndex(label)];
                if (!"".equals(actualValue))
                    sbIntermediate.append(actualValue).append(REGEX_COMMA);
                else
                    sbIntermediate.append("").append(REGEX_COMMA);
            } else
                sbIntermediate.append("").append(REGEX_COMMA);
            LOGGER.debug(" ==sbIntermediate: " + sbIntermediate.toString());
        }//for
    sbIntermediate.append("
");
    end = System.currentTimeMillis();
    LOGGER.info("COMPLETED method createIntermediateStringBuffer. " + (end - start) + " ms. ");
}

As you can see, this method adds a precalculated line to the StringBuffer, reads every line from input csv file, calculate new data from that lines and finally add the generated line to the StringBuffer, so finally i can create the file with that buffer.

I have run jconsole and i can see that there are no memory leaks, i can see the sawtooths representing the creation of objects and the gc recollecting garbaje. It never traspasses the memory heap threshold.

One thing i have noticed is that the time needed for add a new line to the StringBuffer is completed within a very few ms range, (5,6,10), but is raising with time, to (100-200) ms and i suspect more in a near future, so probably this is the battle horse.

I have tried to analyze the code, i know that there are 3 for loops, but they are very shorts, the first loop iterates over 8 elements only:

for (int i = 0; i < servicePath.length; i++) {
        String level = servicePath[i];
        LOGGER.debug("level is: " + level);
        if (splitLine.length > getPositionFromIndex(level)) {
            String name = splitLine[getPositionFromIndex(level)];
            sbIntermediate.append(name);
            hashes.add(name);
            sbIntermediate.append(REGEX_COMMA).append(HashUtils.calculateHash(hashes)).append(REGEX_COMMA);
            LOGGER.debug(" ==sbIntermediate: " + sbIntermediate.toString());
        }
    }

I have meassured the time needed to get the name from the splitline and it is worthless, 0 ms, the same to calculateHash method, 0 ms.

the other loop, are practically the same, iterates over 0 to n, where n is a very tiny int, 3 to 10 for example, so i do not understand why it takes more time to finish the method, the only thing i find is that to add a new line to the buffer is getting slow the process.

I am thinking about a producer consumer multi threaded strategy, a reader thread that reads every line and put them into a circular buffer, another threads take it one by one, process them and add a precalculated line to the StringBuffer, which is thread safe, when the file is fully readed, the reader thread sends a message to to the another threads telling them to stop. Finally i have to save this buffer to a file. What do you think? this is a good idea?

maaartinus · Accepted Answer

I am thinking about a producer consumer multi threaded strategy, a reader thread that reads every line and put them into a circular buffer, another threads take it one by one, process them and add a precalculated line to the StringBuffer, which is thread safe, when the file is fully readed, the reader thread sends a message to to the another threads telling them to stop. Finally i have to save this buffer to a file. What do you think? this is a good idea?

Maybe, but it's quite a lot of work, I'd try something simpler first.

line.split(REGEX_COMMA)

Your REGEX_COMMA is a string which gets compiled into an regex a million times. It's trivial, but I'd try to use a Pattern instead.

You're producing a lot of garbage with your split. Maybe you should avoid it by manually splitting the input into a reused ArrayList (it's just a few lines).

If all you need is writing the result into a file, it might be better to avoid building one huge String. Maybe a List or even a List would be better, maybe writing directly to a buffered stream would do.

You seem to be working with ASCII only. Your encoding is platform dependent which may mean you're using UTF-8, which is possibly slow. Switching to a simpler encoding could help.

Working with byte[] instead of String would most probably help. Bytes are half as big as chars and there's no conversion needed when reading a file. All the operations you do can be done with bytes equally easy.

One thing i have noticed is that the time needed for add a new line to the StringBuffer is completed within a very few ms range, (5,6,10), but is raising with time, to (100-200) ms and i suspect more in a near future, so probably this is the battle horse.

That's resizing, which could be sped up by using the suggested ArrayList, as the amount of data to be copied is much lower. Writing the data out when the buffer gets big would do as well.

I have meassured the time needed to get the name from the splitline and it is worthless, 0 ms, the same to calculateHash method, 0 ms.

Never use currentTimeMillis for this as nanoTime is strictly better. Use a profiler. The problem with a profiler is that it changes what it should measure. As a poor man's profiler, you can compute the sum of all the times spend inside of the suspect method and compare it with the total time.

What's the CPU load and what does GC do when running the program?

creating large csv files in Java getting really slow

Answers (2)

Related Questions