LeAdErQ
LeAdErQ

Reputation: 45

Java Reading big file java heap space

I have written this code:

try(BufferedReader file = new BufferedReader(new FileReader("C:\\Users\\User\\Desktop\\big50m.txt"));){
              String line;
              StringTokenizer st;

              while ((line = file.readLine()) != null){
                  st  = new StringTokenizer(line); // Separation of integers of the file line
                  while(st.hasMoreTokens())
                       numbers.add(Integer.parseInt(st.nextToken())); //Converting and adding to the list of numbers
                  }

          }
          catch(Exception e){
              System.out.println("Can't read the file...");

          }

the big50m file has 50.000.000 integers and i get this runtime error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
    at java.lang.StringBuffer.append(StringBuffer.java:367)
    at java.io.BufferedReader.readLine(BufferedReader.java:370)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at unsortedfilesapp.UnsortedFilesApp.main(UnsortedFilesApp.java:37)
C:\Users\User\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 5 seconds)

I think the problem is the string variable named line. Can you tell me how to fix it ? Because i want fast reading i use StringTokenizer.

Upvotes: 0

Views: 1827

Answers (5)

wumpz
wumpz

Reputation: 9131

Since all numbers are within one line, the BufferedReader approach does not work or scale well. The complete file will be read into memory. Therefore the streaming approach (e.g. from @whbogado) is indeed the way to go.

StreamTokenizer tokenizer = new StreamTokenizer(new FileReader("bigfile.txt"));
tokenizer.parseNumbers(); // default behaviour
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
    if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
        numbers.add((int)Math.round(tokenizer.nval));
    }
}

As you are writing, that you are getting a heap space error as well, I assume, that it is not a problem with the streaming anymore. Unfortunately you are storing all values within a List. I think that is the problem now. You say in a comment, that you do not know the actual count of numbers. Hence you should avoid to store those in a list and do here as well some kind of streaming.

For all who are interested, here is my little testcode (java 8) that does produce a testfile of the needed size USED_INT_VALUES. I limited it for now to 5 000 000 integers. As you can see running it, the memory increases steadily while reading through the file. The only place that holds that much memory is the numbers List.

Be aware that initializing an ArrayList with an initial capacity does not allocate the memory the stored objects need, in your case your Integers.

import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StreamTokenizer;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.logging.Level;
import java.util.logging.Logger;

public class TestBigFiles {

    public static void main(String args[]) throws IOException {
        heapStatistics("program start");
        final int USED_INT_VALUES = 5000000;
        File tempFile = File.createTempFile("testdata_big_50m", ".txt");
        System.out.println("using file " + tempFile.getAbsolutePath());
        tempFile.deleteOnExit();

        Random rand = new Random();
        FileWriter writer = new FileWriter(tempFile);
        rand.ints(USED_INT_VALUES).forEach(i -> {
            try {
                writer.write(i + " ");
            } catch (IOException ex) {
                Logger.getLogger(TestBigFiles.class.getName()).log(Level.SEVERE, null, ex);
            }
        });
        writer.close();
        heapStatistics("large file generated - size=" + tempFile.length() + "Bytes");
        List<Integer> numbers = new ArrayList<>(USED_INT_VALUES);

        heapStatistics("large array allocated (to avoid array copy)");

        int c = 0;
        try (FileReader fileReader = new FileReader(tempFile);) {
            StreamTokenizer tokenizer = new StreamTokenizer(fileReader);

            while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
                if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
                    numbers.add((int) tokenizer.nval);
                    c++;
                }
                if (c % 100000 == 0) {
                    heapStatistics("within loop count " + c);
                }
            }
        }

        heapStatistics("large file parsed nummer list size is " + numbers.size());
    }

    private static void heapStatistics(String message) {
        int MEGABYTE = 1024 * 1024;
        //clean up unused stuff
        System.gc();
        Runtime runtime = Runtime.getRuntime();
        System.out.println("##### " + message + " #####");

        System.out.println("Used Memory:" + (runtime.totalMemory() - runtime.freeMemory()) / MEGABYTE + "MB"
                + " Free Memory:" + runtime.freeMemory() / MEGABYTE + "MB"
                + " Total Memory:" + runtime.totalMemory() / MEGABYTE + "MB"
                + " Max Memory:" + runtime.maxMemory() / MEGABYTE + "MB");
    }
}

Upvotes: 0

C Prasoon
C Prasoon

Reputation: 61

On Running the program with -Xmx2048m, the provided snippet worked (with some adjustments: declared numbers as List numbers = new ArrayList<>(50000000); )

Upvotes: 0

SkateScout
SkateScout

Reputation: 870

here is an version that minimize the memory usage. No byte to char conversion. No String operations. But in this version it does not handle negative numbers.

    public static void main(final String[]a) {
        final Set<Integer> number = new HashSet<>();
        int v = 0;
        boolean use = false;
        int c;
        // Input stream avoid char conversion
        try(InputStream s = new FileInputStream("C:\\Users\\User\\Desktop\\big50m.txt")) {
            // No allocation in the loop
            do {
                if((c = s.read()) == -1) break;
                if(c>='0' && c<='9') { v = v * 10 + c-'0'; use =     true; continue; }
                if(use) number.add(v);
                use = false;
                v = 0;
            } while(true);
            if(use) number.add(v);
        } catch(final Exception e){ System.out.println("Can't read the file..."); }
    }

Upvotes: 0

whbogado
whbogado

Reputation: 947

The readLine() method reads the whole line at once thus eating up a lot of memory. This is highly inefficient and does not scale to an arbitrary big file.

You can use a StreamTokenizer

like this:

StreamTokenizer tokenizer = new StreamTokenizer(new FileReader("bigfile.txt"));
tokenizer.parseNumbers(); // default behaviour
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
    if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
        numbers.add((int)Math.round(tokenizer.nval));
    }
}

I have not tested this code but it gives you the general idea.

Upvotes: 0

PeterMmm
PeterMmm

Reputation: 24630

Create a BufferedReader from the file and read() char by char. Put digit char into a String, then Integer.parseInt(), skip any non-digit char and continue parsing on the the next digit, etc, etc.

Upvotes: 1

Related Questions