Roman Kutlak
Roman Kutlak

Reputation: 2784

Memory awareness and large data

I am currently working on a project that uses a lot of text (hundreds of MB to a few GB of text - DBpedia datasets). To save space I map the strings to numbers and work with the strings only when I need to print stuff. To speed up algorithms that work with the data I designed a Cache class that serves as a key-value cache. Problem is, of course, that when the program runs for a longer time the cache becomes quite big.

The way I manage it at the moment is to limit the cache to a particular number of entries. This solution is working but it is not great. A more flexible approach would be to have some memory limit across all caches and when the limit is reached, disable caching or even empty some of the caches depending on their importance and size.

I am considering implementing a sizeB() method that would return a size of a cache in Bytes so the each instance could report on how much memory it is using. But this, of course, does not solve the problem of when to stop caching... I would probably have to track all the memory usage manually. Perhaps some singleton CacheFactory where all caches are registered and upon reaching limit also emptied?

I was wondering whether there are some 'standard' techniques for doing something like that. Are there any idioms/patterns I should search for?

Also, would it be better to track the memory usage myself (seems more portable but also more laborious) or use some technique like reading /prco/pid on linux, etc.

Upvotes: 1

Views: 174

Answers (1)

Anton Pegushin
Anton Pegushin

Reputation: 470

Yes, there are standard techniques for caching and memory rebalancing. The simplest approach would follow what you're thinking of doing - create a cache 'factory' or a 'manager'. It would allocate cache objects on demand, each objects having a size limit (think of it as a CPU cache line, which has a preset size of 64 bytes). Knowing only the number of cache objects allocated the manager would be able to roughly estimate the amount of used memory and compare it to a total_max_limit which it would know based on the machine it runs on and the type of the OS and so on. So when the total_max_limit is hit and some cache objects need to be freed, the most commonly used approach is LRU (choosing a least recently used cache object to destroy). To implement this you would store pointers to cache objects inside the manager in a deque and when an cache object gets accessed it tells the manager (through the pointer in the cache object structure) to 'mark-as-accessed' meaning to move the pointer to this cache object to the front of the deque. This means that the last pointer in the deque (the *tail) references the least recently used cache object. And factory.rebalance() would just pop_back and free the object returned.

There're other algorithms, but LRU is the most commonly used one. Priority caching could be implemented using it as well. What you'd want is to create several 'cache managers' and distribute their total_max_limits so that the higher priority one get more memory and lower priority ones get less and less and less, what you'll get as a result is that lower priority stuff will be evicted faster and more higher priority stuff will reside in memory/cache. This approach might perform better than calculating some weights-based formula each time for each cache to choose how far from the head of the deque this particular cache object should be moved to.

Upvotes: 1

Related Questions