Reputation: 369
I'm reading a large tsv file (~40G) and trying to prune it by reading line by line and print only certain lines to a new file. However, I keep getting the following exception:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2894)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:532)
at java.lang.StringBuffer.append(StringBuffer.java:323)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at java.io.BufferedReader.readLine(BufferedReader.java:379)
Below is the main part of the code. I specified the buffer size to be 8192 just in case. Doesn't Java clear the buffer once the buffer size limit is reached? I don't see what may cause the large memory usage here. I tried to increase the heap size but it didn't make any difference (machine with 4GB RAM). I also tried flushing the output file every X lines but it didn't help either. I'm thinking maybe I need to make calls to the GC but it doesn't sound right.
Any thoughts? Thanks a lot. BTW - I know I should call trim() only once, store it, and then use it.
Set<String> set = new HashSet<String>();
set.add("A-B");
...
...
static public void main(String[] args) throws Exception
{
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile),"UTF-8"), 8192);
PrintStream output = new PrintStream(outputFile, "UTF-8");
String line = reader.readLine();
while(line!=null){
String[] fields = line.split("\t");
if( set.contains(fields[0].trim()+"-"+fields[1].trim()) )
output.println((fields[0].trim()+"-"+fields[1].trim()));
line = reader.readLine();
}
output.close();
}
Upvotes: 7
Views: 18025
Reputation: 13041
One possibility is that you are running out of heap space during a garbage collection. The Hotspot JVM uses a parallel collector by default, which means that your application can possibly allocate objects faster than the collector can reclaim them. I have been able to cause an OutOfMemoryError with supposedly only 10K live (small) objects, by rapidly allocating and discarding.
You can try instead using the old (pre-1.5) serial collector with the option -XX:+UseSerialGC
. There are several other "extended" options that you can use to tune collection.
Upvotes: 1
Reputation: 719576
I have 3 theories:
The input file is not UTF-8 but some indeterminate binary format that results in extremely long lines when read as UTF-8.
The file contains some extremely long "lines" ... or no line breaks at all.
Something else is happening in code that you are not showing us; e.g. you are adding new elements to set
.
To help diagnose this:
od
(on UNIX / LINUX) to confirm that the input file really contains valid line terminators; i.e. CR, NL, or CR NL.For the record, your slightly suboptimal use of trim
will have no bearing on this issue.
Upvotes: 2
Reputation: 18026
You might want to try removing the String[] fields
declaration out of the loop. As you are creating a new array in every loop. You can just reuse the old one right?
Upvotes: -1
Reputation: 7832
Are you certain the "lines" in the file are separated by newlines?
Upvotes: 3
Reputation: 667
Most likely, what's going on is that the file does not have line terminators, and so the reader just keeps growing it's StringBuffer unbounded until it runs out of memory.
The solution would be to read a fixed number of bytes at a time, using the 'read' method of the reader, and then look for new lines (or other parsing tokens) within the smaller buffer(s).
Upvotes: 17