Streetboy
Streetboy

Reputation: 4401

Reading big files and performing some operations in java

First of all I would try to explain what I need to do. I need to read a file (whose size could be from 1 byte to 2 GB), 2 GB maximum because I try to use MappedByteBuffer for fast reading. Maybe later I will try to read file in chunks in order to read files of arbitrary size.

When i read file I convert its bytes and convert them (using ASCII encoding) to chars which later I put into a StringBuilder and then I put this String Builder into an ArrayList

However I also need to do the following:

  1. User could type blockSize which is the number of chars I have to read into the StringBuilder (which is basically number of file bytes converted to chars)

  2. Once I have collected the user defined char count, I create a copy of the String Builder and put it into an Array List

All steps are performed for every char read. The problem is with String Builder since if the file is big (<500 MB), I get the exception OutOfMemoryError.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
    at java.lang.StringBuilder.<init>(StringBuilder.java:80)
    at java.lang.StringBuilder.<init>(StringBuilder.java:106)
    at borrows.wheeler.ReadFile.readFile(ReadFile.java:43)
Java Result: 1 

I post my code, maybe someone could suggest improvements to this code or suggest some alternatives.

public class ReadFile {

    //matrix block size
    public int blockSize = 100;

    public int charCounter = 0;

    public ArrayList readFile(File file) throws FileNotFoundException, IOException {

        FileChannel fc = new FileInputStream(file).getChannel();
        MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0, (int) fc.size());

        ArrayList characters = new ArrayList(); 
        int counter = 0;
        StringBuilder sb = new StringBuilder();//blockSize-1

        while (mbb.hasRemaining()) {   

        char charAscii = (char)mbb.get();


            counter++;
            charCounter++;

             if (counter == blockSize){

                sb.append(charAscii);
                characters.add(new StringBuilder(sb));//new StringBuilder(sb)
                sb.delete(0, sb.length());
                counter = 0;

            }else{

                sb.append(charAscii);

             }

         if(!mbb.hasRemaining()){
            characters.add(sb);
        }



        }
        fc.close();
        return characters;


    }

}

EDIT: I am doing Burrows-Wheeler transformation. There i should read every file then by Block Size create as many as needed matrixes. well i believe that wiki will explain better than me:

http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform

Upvotes: 2

Views: 1101

Answers (2)

user207421
user207421

Reputation: 310915

I try to use MappedByteBuffer for fast reading. Maybe later I will try to read file in chunks in order to read files of arbitrary size.

When i read file I convert its bytes and convert them (using ASCII encoding) to chars which later I put into a StringBuilder and then I put this String Builder into an ArrayList

This sounds more like a problem than a solution. I suggest to you that the file already is ASCII, or character data; that it could be read pretty efficiently using a BufferedReader; and that it can be processed one line at a time.

So do that. You won't get even double the speed by using a MappedByteBuffer, and everything you're doing including the MappedByteBuffer is consuming memory on a truly heroic scale.

If the file isn't such that it can be processed line by line, or record by record, there is something badly wrong upstream.

Upvotes: 1

DNA
DNA

Reputation: 42607

If you load large files, it's not entirely surprising that you run out of memory.

How much memory do you have? Are you on a 64-bit system with 64-bit Java? How much heap memory have you allocated (e.g using -Xmx setting)?

Bear in mind that you will need at least twice as much memory as the filesize, because Java uses Unicode UTF-16, which uses at least 2 bytes for each character, but your input is one byte per character. So to load a 2GB file you will need at least 4GB allocated to the heap just for storing this text data.

Also, you need to sort out the logic in your code - you do the same sb.append(charAscii) in the if and the else, and you test !mbb.hasRemaining() in every iteration of a while((mbb.hasRemaining()) loop.

As I asked in your previous question, do you need to store StringBuilders, or would the resulting Strings be OK? Storing strings would save space because StringBuilder allocates memory in big chunks (I think it doubles in size every time it runs out of space!) so may waste a lot.

If you do have to use StringBuilders then pre-sizing them to the value of blockSize would make the code more memory-efficient (and faster).

Upvotes: 1

Related Questions