Reputation: 180
I have a program that reads a large list of sequences from a file and does a calculation among all of the pairs in that list. It then stores all of these calculations into a hashset. When running this program about halfway through, I get a GC overhead limit error.
I realize this is because the garbage collector is using up 98% of the computation time and is unable to recover even 2% of the heap. Here is the code I have:
ArrayList<String> c = loadSequences("file.txt"); // Loads 60 char DNA sequences
HashSet<DNAPair,Double> LSA = new HashSet<DNAPair,Double>();
for(int i = 0; i < c.size(); i++) {
for(int j = i+1; j < c.size(); j++) {
LSA.put(new DNAPair(c.get(i),c.get(j)),localSeqAlignmentSimilarity(c.get(i),c.get(j)));
}
}
And here's the code for the actual method:
public static double localSeqAlignmentSimilarity(String s1, String s2) {
s1 = " " + s1;
s2 = " " + s2;
int max = 0,h = 0,maxI = 0,maxJ = 0;
int[][] score = new int[61][61];
int[][] pointers = new int[61][61];
for(int i = 1; i < s1.length(); i++) {
pointers[i][0] = 2;
}
for(int i = 1; i < s2.length(); i++) {
pointers[0][i] = 1;
}
boolean inGap = false;
for(int i = 1; i < s1.length(); i++) {
for(int j = 1; j < s2.length(); j++) {
h = -99;
if(score[i-1][j-1] + match(s1.charAt(i),s2.charAt(j)) > h) {
h = score[i-1][j-1] + match(s1.charAt(i),s2.charAt(j));
pointers[i][j] = 3;
inGap = false;
}
if(!inGap) {
if(score[i-1][j] + GAPPENALTY > h) {
h = score[i-1][j] + GAPPENALTY;
pointers[i][j] = 2;
inGap = true;
}
if(score[i][j-1] + GAPPENALTY > h) {
h = score[i][j-1] + GAPPENALTY;
pointers[i][j] = 1;
inGap = true;
}
} else {
if(score[i-1][j] + GAPEXTENSION > h) {
h = score[i-1][j] + GAPEXTENSION;
pointers[i][j] = 2;
inGap = true;
}
if(score[i][j-1] + GAPEXTENSION > h) {
h = score[i][j-1] + GAPEXTENSION;
pointers[i][j] = 1;
inGap = true;
}
}
if(0 > h) h = 0;
score[i][j] = h;
if(h >= max) {
max = h;
maxI = i;
maxJ = j;
}
}
}
double matches = 0;
String o1 = "", o2 = "";
while(!(maxI == 0 && maxJ == 0)) {
if(pointers[maxI][maxJ] == 3) {
o1 += s1.charAt(maxI);
o2 += s2.charAt(maxJ);
maxI--;
maxJ--;
} else if(pointers[maxI][maxJ] == 2) {
o1 += s1.charAt(maxI);
o2 += "_";
maxI--;
} else if(pointers[maxI][maxJ] == 1) {
o1 += "_";
o2 += s2.charAt(maxJ);
maxJ--;
}
}
StringBuilder a = new StringBuilder(o1);
b = new StringBuilder(o2);
o1 = a.reverse().toString();
o2 = b.reverse().toString();
a.setLength(0);
b.setLength(0);
for(int i = 0; i < Math.min(o1.length(), o2.length()); i++) {
if(o1.charAt(i) == o2.charAt(i)) matches++;
}
return matches/Math.min(o1.length(), o2.length());
}
I thought that this was because of all the variables I declare inside the method (the two int arrays and the stringbuilders etc.) creating more and more objects every time the method is run so I changed them all to static fields and cleared them everytime (ex. Arrays.fill(score,0);) instead of creating a new object.
However this didn't help at all and I still got the same error.
Could it be that the hashset that stores all of the calculations is getting too big and is unable to be stored by java? I'm not getting an out of heap space error so it seems kind of strange.
I also changed the command line argument to give more space to the JVM but that didn't seem to help.
Any insight on this problem would be helpful. Thanks!
Upvotes: 1
Views: 561
Reputation: 8586
This is a problem, if c.size() is 73657 and the sequences are unique:
HashSet<DNAPair,Double> LSA = new HashSet<DNAPair,Double>();
for(int i = 0; i < c.size(); i++) {
for(int j = i+1; j < c.size(); j++) {
LSA.put(...);
}
}
Assuming these are unique sequences, you're basically adding an element to LSA for each pair. You mention you have 70k sequences, so you are going to have 70k * 70k = ~5 billion pairs, each of which will take at a MINIMUM 4 bytes to store, meaning you'd need 20+ GB allocated at a minimum for this to be feasible.
Upvotes: 1
Reputation: 21815
Yes, it could indeed be that the amount of data is simply too big to store in memory. I would start by trying to actually profile the program's memory usage while it is running using something like JConsole or reading from the MemoryMXBean generally from within your program.
In case it's useful, I wrote the small Classmexer agent which allows you to query the actual memory usage of a Java object (and sub-objects) from within your Java program.
Incidentally, it's usually not beneficial to try and 'fool' or pre-empt the JVM's memory management system by doing things like making objects static that really shouldn't naturally be static.
Upvotes: 0