When is it better to load all data from file at initialization, as opposed to performing file lookup on-demand? (Java)

Question

One is sometimes faced with the task of parsing data stored in files on the local system. A significant dilemma is whether to load and parse all of the file data at the beginning of the program run, or access the file throughout the run and read data on-demand (assuming the file is sorted, so search is performed in constant time).

When it comes to small data sets, the first approach seems favorable, but with larger ones the threat of clogging up the heap increases.

What are some general guidelines one can use in such scenarios?

Jim Mischel · Accepted Answer

That depends entirely on what your program needs to do. The general advice is to keep only as much data in memory as is necessary. For example, consider a simple program that reads each record from a file of transactions, and then reports the total number of transactions and the total dollar amount:

count = 0
dollars = 0
while not end of file
    read record
    parse record
    increment count
    add transaction amount to dollars
end
output count and dollars

Here, you clearly need to have only one transaction record in memory at a time. So you read a record, process it, and discard it. It makes no sense to load all of the records into a list or other data structure, and then iterate over the list to get the count and total dollar amount.

In some cases you do need multiple records, perhaps all of them, in memory. In those cases, all you do is re-structure the program a little bit. You keep the reading loop, but have it add records to a list. Then afterwards you can process the list:

list = []
while not end of file
    read record
    parse record
    add record to list
end
process list
output results

It makes no sense to load the entire file into a list, and then scan the list sequentially to obtain count and dollar amount. Not only is that a waste of memory, it makes the program more complex, uses memory to no gain, will be slower, and will fail with large data sets. The "memory vs performance" tradeoff doesn't always apply. Often, as in this case, using more memory makes the program slower.

I generally find it a good practice to structure my solutions so that I keep as little data in memory as is practical. If the solution is simpler with sorted data, for example, I'll make sure that the input is sorted before I run the program.

That's the general advice. Without specific examples from you, it's hard to say what approach would be preferred.

When is it better to load all data from file at initialization, as opposed to performing file lookup on-demand? (Java)

Answers (2)

Related Questions