Reputation: 55
One is sometimes faced with the task of parsing data stored in files on the local system. A significant dilemma is whether to load and parse all of the file data at the beginning of the program run, or access the file throughout the run and read data on-demand (assuming the file is sorted, so search is performed in constant time).
When it comes to small data sets, the first approach seems favorable, but with larger ones the threat of clogging up the heap increases.
What are some general guidelines one can use in such scenarios?
Upvotes: 0
Views: 703
Reputation: 133995
That depends entirely on what your program needs to do. The general advice is to keep only as much data in memory as is necessary. For example, consider a simple program that reads each record from a file of transactions, and then reports the total number of transactions and the total dollar amount:
count = 0
dollars = 0
while not end of file
read record
parse record
increment count
add transaction amount to dollars
end
output count and dollars
Here, you clearly need to have only one transaction record in memory at a time. So you read a record, process it, and discard it. It makes no sense to load all of the records into a list or other data structure, and then iterate over the list to get the count and total dollar amount.
In some cases you do need multiple records, perhaps all of them, in memory. In those cases, all you do is re-structure the program a little bit. You keep the reading loop, but have it add records to a list. Then afterwards you can process the list:
list = []
while not end of file
read record
parse record
add record to list
end
process list
output results
It makes no sense to load the entire file into a list, and then scan the list sequentially to obtain count and dollar amount. Not only is that a waste of memory, it makes the program more complex, uses memory to no gain, will be slower, and will fail with large data sets. The "memory vs performance" tradeoff doesn't always apply. Often, as in this case, using more memory makes the program slower.
I generally find it a good practice to structure my solutions so that I keep as little data in memory as is practical. If the solution is simpler with sorted data, for example, I'll make sure that the input is sorted before I run the program.
That's the general advice. Without specific examples from you, it's hard to say what approach would be preferred.
Upvotes: 0
Reputation: 7071
That's the standard tradeoff in programming - memory vs performance, Space–time tradeoff etc. There is no "right" answer to that question. It depends on the memory you have, speed you need, size of files, how often you query them etc.
In your specific case and since it seems like a one time job (if you are able to read it in the beginning) then it probably won't matter that much ;)
Upvotes: 1