Memory awareness and large data

Question

I am currently working on a project that uses a lot of text (hundreds of MB to a few GB of text - DBpedia datasets). To save space I map the strings to numbers and work with the strings only when I need to print stuff. To speed up algorithms that work with the data I designed a Cache class that serves as a key-value cache. Problem is, of course, that when the program runs for a longer time the cache becomes quite big.

The way I manage it at the moment is to limit the cache to a particular number of entries. This solution is working but it is not great. A more flexible approach would be to have some memory limit across all caches and when the limit is reached, disable caching or even empty some of the caches depending on their importance and size.

I am considering implementing a sizeB() method that would return a size of a cache in Bytes so the each instance could report on how much memory it is using. But this, of course, does not solve the problem of when to stop caching... I would probably have to track all the memory usage manually. Perhaps some singleton CacheFactory where all caches are registered and upon reaching limit also emptied?

I was wondering whether there are some 'standard' techniques for doing something like that. Are there any idioms/patterns I should search for?

Also, would it be better to track the memory usage myself (seems more portable but also more laborious) or use some technique like reading /prco/pid on linux, etc.

Memory awareness and large data

Answers (1)

Related Questions