Reputation: 422

Loading and processing very large files with java

I'm trying to load in a csv file with a huge amount of lines (>5 million) but it slows down massively when trying to process them all into an arraylist of each value

I've tried a few different variations of reading and removing from the input list i loaded from the file, but it still ends up running out of heapspace, even when i allocate 14gb to the process, while the file is only 2gb

I know i need to be removing values so that i dont end up with duplicate references in memory, so that I dont end up with an arraylist of lines and also an arraylist of the individual comma seperated values, but i have no idea how to do something like that

Edit : For reference, in this particular situation, data should end up containing 16 * 5 million values.

If there's a more elegant solution, i'm all for it

The intention when loading this file is to process it as a database, with the appropriate methods like select and select where, all handled by a sheet class. It worked just fine with my smaller sample file of 36k lines, but i guess it doesnt scale very well

Current code :

//Load method to load it from file

private static CSV loadCSV(String filename, boolean absolute)
{
    String fullname = "";
    if (!absolute)
    {
        fullname = baseDirectory + filename;
        if (!Load.exists(fullname,false))
            return null;
    }
    else if (absolute)
    {
        fullname = filename;
        if (!Load.exists(fullname,false))
            return null;
    }

    ArrayList<String> output = new ArrayList<String>(); 
    AtomicInteger atomicInteger = new AtomicInteger(0);

    try (Stream<String> stream = Files.lines(Paths.get(fullname)))
    {
        stream.forEach(t -> {
            output.add(t);  
            atomicInteger.getAndIncrement();

            if (atomicInteger.get() % 10000 == 0)
            {
                Log.log("Lines done " + output.size());
            }

        });

        CSV c = new CSV(output);        

        return c;
    }
    catch (IOException e)
    {
        Log.log("Error reading file " + fullname,3,"FileIO");
        e.printStackTrace();
    }       
    return null;

}


//Process method inside CSV class

public CSV(List<String> output)
{
    Log.log("Inside csv " + output.size());

    ListIterator<String> iterator = output.listIterator();

    while (iterator.hasNext())
    {
        ArrayList<String> d = new ArrayList<String>(Arrays.asList(iterator.next().split(splitter,-1)));
        data.add(d);
        iterator.remove();
    }       
}

Upvotes: 1

Answers (4)

Saptarshi Basu

Reputation: 9303

I think some key concepts are missing here:

You said the file size is 2GB. That does not mean that when you load that file data in an ArrayList, the size in memory would also be 2GB. Why? Usually files store data using UTF-8 character encoding, whereas JVM internally stores String values using UTF-16. So, assuming your file contains only ASCII characters, each character occupies 1 byte in the filesystem whereas 2 bytes in memory. Assuming (for the sake of simplicity) all String values are unique, there will be space required to store the String references which are 32 bits each (assuming a 64 bits system with compressed oop). How much is your heap (excluding other memory areas)? How much is your eden space and old space? I'll come back to this again shortly.
In your code, you don't specify ArrayList size. This is a blunder in this case. Why? JVM creates a small ArrayList. After sometime JVM sees that this guy keeps pumping in data. Let's create a bigger ArrayList and copy the data of the old ArrayList into the new list. This event has some deeper implications when you are dealing with such huge volume of data: firstly, note that both the old and new arrays (with millions of entries) are in memory simultaneously occupying space, secondly unnecessarily data copy happens from one array to another - not once or twice but repeatedly, everytime the array run out of space. What happens to the old array? Well it's discarded and needs to be garbage collected. So, these repeated array copy and garbage collections slow down the process. CPU is really working hard here. What happens when your data no longer fits into the young generation (which is smaller than heap)? Maybe you need to see the behaviour using something like JVisualVM.

All in all, what I mean to say is there are good number of reasons why a 2GB file fills up your much larger heap and why your process performance is poor.

Upvotes: 1

Joakim Danielson

Reputation: 52088

I would have a method that took a line read from the file as parameter and split it into a list of strings and then returned that list. I would then add that list to the CSV object in the file reading loop. That would mean only one large collection instead of two and the read lines could be freed from memory quicker. Something like this

CSV csv = new CSV();
try (Stream<String> stream = Files.lines(Paths.get(fullname))) {
    stream.forEach(t -> {
        List<String> splittedString = splitFileRow(t);
        csv.add(splittedString);  
    });

Upvotes: 0

Grigory K

Reputation: 1381

You need to use any database, which provide required functionality for your task (select, group). Any database can effective read and aggregate 5 million rows. Don't try to use "operations on ArrayList", it's works good only on small dataset.

Upvotes: 3

dumitru

Reputation: 2108

Trying to solve this problem using pure Java it is overwhelming. I suggest using a processing engine like Apache Spark that can process the file in a distributed way, by increasing the level of parallelism. Apache Spark has specific APIs to load CSV file:

spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")

You can transform it into an RDD, or Dataframe and perform operations on it. You can find more online, or here

Upvotes: 0

Loading and processing very large files with java

Answers (4)

Related Questions