Reputation: 438

Reducing memory churn when processing large data set

Java has a tendency to create a large number objects that needs to be garbage collected when processing large data set. This happens fairly frequently when streaming a amounts of data from the database, creating reports, etc. Is there a strategy to reduce the memory churn.

In this example, the object based version spends significant amount of times (2+ seconds) generating objects and performing garbage collection whereas the boolean array version completes in a fraction of a section without any garbages collection whatsoever.

How do I reduce the memory churn (the need for large number of garbage collections) when processing large data sets?

java -verbose:gc -Xmx500M UniqChars
...
----------------
[GC 495441K->444241K(505600K), 0.0019288 secs] x 45 times
70000007
================
70000007



import java.util.HashSet;
import java.util.Set;
public class UniqChars {
    static String a=null;
    public static void main(String [] args) {
            //Generate data set
            StringBuffer sb=new StringBuffer("sfdisdf");
            for (int i =0; i< 10000000; i++) {
                    sb.append("sfdisdf");
            }
            a=sb.toString();
            sb=null;  //free sb
            System.out.println("----------------");
            compareAsSet();
            System.out.println("================");
            compareAsAry();
    }

    public static void compareAsSet() {
            Set<String> uniqSet = new HashSet<String>();
            int n=0;
            for(int i=0; i<a.length(); i++) {
                    String chr = a.substring(i,i);
                    uniqSet.add(chr);
                    n++;
            }
            System.out.println(n);
    }

    public static void compareAsAry() {
            boolean uniqSet[] = new boolean[65536];
            int n=0;
            for(int i=0; i<a.length(); i++) {
                    int chr = (int) a.charAt(i);
                    uniqSet[chr]=true;
                    n++;
            }
            System.out.println(n);
    }
}

Upvotes: 1

Answers (3)

Peter Lawrey

Reputation: 533510

For comparison, if you wrote this it would do the same thing.

public static void compareLength() {
    // All the loop does is count the length in a complex way.
    System.out.println(a.length());
}

// I assume you intended to write this.
public static void compareAsBitSet() {
    BitSet uniqSet = new BitSet();
    for(int i=0; i<a.length(); i++)
        uniqSet.set(a.charAt(i));
    System.out.println(uniqSet.size());
}

Note: the BitSet uses 1 bit per element, rather than 1 byte per element. It also expands as required so say you have ASCII text, the BitSet might use 128-bits or 16 bytes (plus 32-byte overhead) The boolean[] uses 64 KB which is much higher. Ironically, using a boolean[] can be faster as it involves less bit shifting and only the portion of the array used needs to be in memory.

As you can see, with either solution, you get a much more efficient result because you use a better algorithm for what needs to be done.

Upvotes: 1

chubbsondubs

Reputation: 38706

Well as pointed out by one of the comments it's your code, not Java at fault for memory churn. So let's see you've written this code that builds an insanely large String from a StringBuffer. Calls toString() on it. Then calls substring() on that insanely large string which is in a loop and creating new a.length() Strings. Then does some in place junk on an array that really will perform pretty damn fast since there is no object creation, but ultimately writes to true to the same 5-6 locations in a huge array. Waste much? So what did you think would happen? Ditch StringBuffer and use StringBuilder since it's not fully synchronized which will be a little faster.

Ok so here's where your algorithm is probably spending its time. See the StringBuffer is allocating an internal character array to store stuff in each time you call append(). When that character array fills entirely up, it has to allocate a larger character array, copy all that junk you just wrote to it into the new array, then append what you originally called it with. So your code is allocating filling up, allocating a bigger chunk, copying that junk to the new array, then repeating that process until it does that 1000000 times. You can speed that up by pre-allocating the character array for the StringBuffer. Roughly that's 10000000 * "sfdisdf".length(). That will keep Java from creating tons of memory that it just dumps over and over.

Next is the compareAsSet() mess. Your line String chr = a.substring(i,i); is creating NEW strings a.length() times. Well since you're doing a.substring(i,i) is only a character you could just charAt(i) then there's no allocating happen. There's also an option of CharSequence which doesn't create a new String with it's own character array but simply points to the original underlying char[] with an offset and length. String.subSequence()

You plug this same code in any other language and it'll suck there too. In fact I'd say far far worse. Just try this is C++ and watch it be significantly worse than Java should you allocate and deallocate this much. See Java memory allocation is way way way faster than C++ because everything in Java is allocated from a memory pool so creating objects is magnitudes faster. But, there are limits. Furthermore, Java compresses its memory should it become too fragmented, C++ doesn't. So as you allocate memory and dump it, just in the same way, you'll probably run the risk of fragmenting the memory in C++. That could mean your StringBuffer might run out of the ability to grow large enough to finish and would crash.

In fact that might also explain some of the performance issues with GC because it's having to make room more a continuous block big enough after lots of trash has been taken out. So Java is not only cleaning up the memory its also having to compress the memory address space so it can get a block big enough for your StringBuffer.

Anyway, I'm sure your just testing the tires, but testing with code like this isn't really smart because it'll never perform well because it's unrealistic memory allocation. You know the old adage Garbage In Garbage Out. And that's what you got Garbage.

Upvotes: 4

Dev

Reputation: 12196

In your example your two methods are doing very different things.

In compareAsSet() you are generating the same 4 Strings ("s", "d", "f" and "i") and calling String.hashCode() and String.equals(String) (HashSet does this when you try to add them) 70000007 times. What you end up with is a HashSet of size 4. While you are doing this you are allocating String objects each time String.substring(int, int) returns which will force a minor collection every time the 'new' generation of the garbage collector gets filled.

In compareAsAry() you've allocated a single array 65536 elements wide changed some values in it and and then it goes out of scope when the method returns. This is a single heap memory operation vs 70000007 done in compareAsSet. You do have a local int variable being changed 70000007 times but this happens in stack memory not in heap memory. This method does not really generate that much garbage in the heap compared to the other method (basically just the array).

Regarding churn your options are recycle objects or tuning the garbage collector.

Recycling is not really possible with Strings in general as they are immutable, though the VM may perform interning operations this only reduces total memory footprint not garbage churn. A solution targeted for the above scenario that recycles could be generated but the implementation would be brittle and inflexible.

Tuning the garbage collector so that the 'new' generation is larger could reduce the total number of collections that has to be performed during your method call and thus increase the throughput of the call, you could also just increase the heap size in general which would accomplish the same thing.

For futher reading on garbage collector tuning in Java 6 I recommend the Oracle white paper linked below.

http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

Upvotes: 4

Reducing memory churn when processing large data set

Answers (3)

Related Questions