Reputation: 2248
I have a large data-set from which I derive a hierarchical set of summaries, at different levels of coarseness. I want to cache these summaries in a file on disk, with each summary retrievable from within the file via its offset. The initial summary is derived by taking smallish chunks (c. 256 bytes) from the initial data-set and extracting the maximum value from each chunk. Subsequent summaries are then derived by taking the max of each pair of values in the preceding summary. The following (rudimentary) illustration will hopefully clarify:
251 18 5 91 11 17 54 16 9 31 201 148 173 214 66 43 ;;Initial data-set (chunked)
251 54 201 214 ;;Summary 0
251 214 ;;Summary 1
251 ;;Summary 2
What I'm trying to implement is a means of deriving (and then caching) these summaries which scales to large data sets, for example on the order of 4GB. Speed is not particularly an issue, but space is: because for data-sets of that size even the summaries might be too large to process in memory. I have been experimenting with a number of approaches:
The naive approach would be to simply write out each layer in full then read it back in to compute the next layer. This is clearly the simplest way of doing it, but it doesn't seem the most elegant or efficient. Memory-mapping might offer some improvement, but that might also mean that I'd need to preallocate the file beforehand.
Calculate each layer in chunks - calculate a chunk of the first layer, then the second layer, then the third, and so on, finally writing the chunks to the file at their appropriate offsets and restarting the process. The problem with this is that since each chunk would be half the size of the chunk from which it is being calculated we would arrive at a 0 chunk-size before all the layers had been calculated.
Use a single file for each summary.
Use some kind of tree-based approach (the above illustration - if turned on its head - resembles a heap). Perhaps each node in the tree might represent, say, a chunk of 1024 bytes within each layer. A parent node would have two children representing consecutive chunks within the preceding layer, and its contents would be calculated from these children. Once this has been done the child nodes could simply be flushed to disk. I suspect that this process could be done entirely in memory (though I don't know what its complexity might be).
Thoughts/observations most welcome.
Christopher
Upvotes: 2
Views: 68
Reputation: 2248
OK, so after a bit more research I eventually went with a B-Tree, with the top few levels cached in main memory. Works for now.
Chris
Upvotes: 1