JohnDoe
JohnDoe

Reputation: 89

What to choose to read from large file?

Could you tell me please, what to choose if I need to read from very big(~1Gb) .txt file which contains some unformated data(mostly String text) in UTF-8: Scanner, BufferedReader or mb something else even better(probably from NIO or side libraries)?

Upvotes: 4

Views: 95

Answers (2)

Stephen C
Stephen C

Reputation: 718758

It depends on what you are trying to do with the file.

For example, ask yourself:

  • do I need to tokenize it; i.e. treat it as a stream of "words" or "symbols"?
  • do I need to split it into lines and process it a line at a time?
  • do I need to load the entire file into memory? As a big array of characters, lines, tokens?

Once you have figured out that side of things, and one of the alternatives you are considering for reading the file is likely to come out as a better match than the others.

(And we certainly can't give you sound / balanced advice on the best way to read the data if we don't understand what you are intending to do with it.)


My advice is to think about how you are processing data before you spend your time on efficiency concerns. There is a good chance that the choice of technique / API for reading the file won't be what is limiting your application's overall performance.

Upvotes: 3

Domi
Domi

Reputation: 24508

The size of the file does not matter for correctness (as long as you have enough ram to store the intermediate data), but it does matter in terms of performance. This website explains how to read UTF-8 in Java. It uses InputStreamReader:

         try {
            Reader reader = new InputStreamReader(
                        new FileInputStream(args[0]),"UTF-8");
            BufferedReader fin = new BufferedReader(reader);

            String line;
            while ((line = fin.readLine())!=null) {
                // do something with line
            }
            fin.close();

        } catch (IOException e) {
            e.printStackTrace();
        }

Note that he reads line by line. For large files, IO performance is important, so you might instead want to read the data in chunks of 4k or 8k bytes instead. Note though that that might break up characters (since UTF-8 characters can have one or more bytes, there is no way of telling in advance if a character ends exactly on a chunk boundary).

In that case, you either want to treat the text as data until you finished reading, or you must go through all read characters to find out, if you must append the last byte to the next chunk before processing it.

Upvotes: 2

Related Questions