erotsppa
erotsppa

Reputation: 15061

Why does reading a file into memory takes 4x the memory in Java?

I have the following code which reads in the follow file, append a \r\n to the end of each line and puts the result in a string buffer:

public InputStream getInputStream() throws Exception {
    StringBuffer holder = new StringBuffer();
    try{
        FileInputStream reader = new FileInputStream(inputPath);


        BufferedReader br = new BufferedReader(new InputStreamReader(reader));
        String strLine;
        //Read File Line By Line
        boolean start = true;
        while ((strLine = br.readLine()) != null)   {
            if( !start )    
                holder.append("\r\n");

            holder.append(strLine);
            start = false;
        }
        //Close the input stream
        reader.close();
    }catch (Throwable e){//this is where the heap error is caught up to 2Gb
      System.err.println("Error: " + e.getMessage());
    }


    return new StringBufferInputStream(holder.toString());
}

I tried reading in a 400Mb file, and I changed the max heap space to 2Gb and yet it still gives the out of memory heap exception. Any ideas?

Upvotes: 13

Views: 8465

Answers (9)

joeslice
joeslice

Reputation: 3464

I also recommend checking out Commons IO FileUtils class for this. Specifically: org.apache.commons.io.FileUtils#readFileToString. You can also specify the encoding if you know you only are using ASCII.

Upvotes: 0

David
David

Reputation: 1251

Others have explained why you're running out of memory. As to how to solve this problem, I'd suggest writing a custom FilterInputStream subclass. This class would read one line at a time, append the "\r\n" characters and buffer the result. Once the line has been read by the consumer of your FilterInputStream, you'd read another line. This way you'd only ever have one line in memory at a time.

Upvotes: 1

Andreas Dolk
Andreas Dolk

Reputation: 114817

It's the StringBuffer. The empty constructor creates a StringBuffer with a initial length of 16 Bytes. Now if you append something and the capacity is not sufficiant, it does an Arraycopy of the internal String Array to a new buffer.

So in fact, with each line appended the StringBuffer has to create a copy of the complete internal Array which nearly doubles the required memory when appending the last line. Together with the UTF-16 representation this results in the observed memory demand.

Edit

Michael is right, when saying, that the internal buffer is not incremented in small portions - it roughly doubles in size each to you need more memory. But still, in the worst case, say the buffer needs to expand capacity just with the very last append, it creates a new array twice the size of the actual one - so in this case, for a moment you need roughly three times the amount of memory.

Anyway, I've learned the lesson: StringBuffer (and Builder) may cause unexpected OutOfMemory errors and I'll always initialize it with a size, at least when I have to store large Strings. Thanks for the question :)

Upvotes: 4

Peter Lawrey
Peter Lawrey

Reputation: 533820

I would suggest you use the OS file cache instead of copying the data into Java memory via characters and back to bytes again. If you re-read the file as required (perhaps transforming it as you go) it will be faster and very likely to be simpler

You need over 2 GB because 1 byte letters use char (2-bytes) in memory and when your StringBuffer resizes you need double that (to copy the old array to the larger new array) The new array is typically 50% larger so you need up to 6x the original file size. If the performance wasn't bad enough, you are using StringBuffer instead of StringBuilder which synchronizes every call when it is clearly not needed. (This only slows you down, but uses the same amount of memory)

Upvotes: 1

Yishai
Yishai

Reputation: 91931

At the last insert into the StringBuffer, you need three times the memory allocated, because the StringBuffer always expands by (size + 1) * 2 (which is already double because of unicode). So a 400GB file could require an allocation of 800GB * 3 == 2.4GB at the end of the inserts. It may be something less, that depends on exactly when the threshold is reached.

The suggestion to concatenate Strings rather than using a Buffer or Builder is in order here. There will be a lot of garbage collection and object creation (so it will be slow), but a much lower memory footprint.

[At Michael's prompting, I investigated this further, and concat wouldn't help here, as it copies the char buffer, so while it wouldn't require triple, it would require double the memory at the end.]

You could continue to use the Buffer (or better yet Builder in this case) if you know the maximum size of the file and initialize the size of the Buffer on creation and you are sure this method will only get called from one thread at a time.

But really such an approach of loading such a large file into memory at once should only be done as a last resort.

Upvotes: 1

Michael Borgwardt
Michael Borgwardt

Reputation: 346476

You have a number of problems here:

  • Unicode: characters take twice as much space in memory as on disk (assuming a 1 byte encoding)
  • StringBuffer resizing: could double (permanently) and triple (temporarily) the occupied memory, though this is the worst case
  • StringBuffer.toString() temporarily doubles the occupied memory since it makes a copy

All of these combined mean that you could require temporarily up to 8 times your file's size in RAM, i.e. 3.2G for a 400M file. Even if your machine physically has that much RAM, it has to be running a 64bit OS and JVM to actually get that much heap for the JVM.

All in all, it's simply a horrible idea to keep such a huge String in memory - and it's totally unneccessary as well - since your method returns an InputStream, all you really need is a FilterInputStream that adds the line breaks on the fly.

Upvotes: 12

Adamski
Adamski

Reputation: 54725

It may be to do with how the StringBuffer resizes when it reaches capacity - This involves creating a new char[] double the size of the previous one and then copying the contents across into the new array. Together with the points already made about characters in Java being stored as 2 bytes this will definitely add to your memory usage.

To resolve this you could create a StringBuffer with sufficient capacity to begin with, given that you know the file size (and hence approximate number of characters to read in). However, be warned that the array allocation will also occur if you then attempt to convert this large StringBuffer into a String.

Another point: You should typically favour StringBuilder over StringBuffer as the operations on it are faster.

You could consider implementing your own "CharBuffer", using for example a LinkedList of char[] to avoid expensive array allocation / copy operations. You could make this class implement CharSequence and perhaps avoid converting to a String altogether. Another suggestion for more compact representation: If you're reading in English text containing large numbers of repeated words you could read and store each word, using the String.intern() function to significantly reduce storage.

Upvotes: 24

DaveR
DaveR

Reputation: 9640

To begin with Java strings are UTF-16 (i.e. 2 bytes per character), so assuming your input file is ASCII or a similar one-byte-per-character format then holder will be ~2x the size of the input data, plus the extra \r\n per line and any additional overhead. There's ~800MB straight away, assuming a very low storage overhead in StringBuffer.

I could also believe that the contents of your file is buffered twice - once at the I/O level and once in the BufferedReader.

However, to know for sure, it's probably best to look at what's actually on the heap - use a tool like HPROF to see exactly where your memory has gone.

I terms of solving this, I suggest you process a line at a time, writing out each line after your have added the line termination. That way your memory usage should be proportional to the length of a line, instead of the entire file.

Upvotes: 13

Chris W. Rea
Chris W. Rea

Reputation: 5501

It's an interesting question, but rather than stress over why Java is using so much memory, why not try a design that doesn't require your program to load the entire file into memory?

Upvotes: 12

Related Questions