Reputation: 836
Lets say I have an URL class in Python. The URL class represents an URL and offers a method to download its contents. It also caches that content to speed up subsequent calls.
@dataclass(frozen)
class URL:
url: str
_cache: Optional[bytes] = None
def get_content(self) -> bytes:
if self._cache is None:
self._cache = requests.get(self.url)
return self._cache
This code worked fine so far. Now I was asked to download huge amounts of URLs here (for sane reasons). To prevent the possibility of misusage, I want to support use cases where every instance of URL
will be alive until all URLs are downloaded.
Having huge amounts of URL
alive and cached will lead to memory exhaustion. Now I am wondering how I could design a cache that forgets only when there is memory pressure.
I considered following alternatives:
tl;dr: How could I implement a cache in Python that holds weak references and only drops them seldomly and on memory pressure?
Updates:
I have no clear criterions in mind. I expect that others have developed good solutions I just do not know of. In Java I suspect that SoftReferences would be an acceptable solution.
MisterMyagi found a good wording:
Say the URL cache would evict an item if some unrelated, numeric computation needs memory?
I want to keep elements as long as possible but free them when any other code of the same python process would need it.
Maybe a solution would drop instances of URL only by the Garbage Collector. Then I could try to configure the Garbage Collector accordingly. But maybe someone has come up with a more clever idea already, so I can avoid to reinvent the wheel.
Upvotes: 1
Views: 248
Reputation: 836
I found a nice solution for the concrete problem I was facing. Unfortunately, this is not a solution to the concrete question, which is concerned only with a cache. Thus, I will not accept my answer. However, I hope it will be inspiring for others.
In my application, there is a storage backend which will call
url.get_content()
. Using the storage backend will look roughly like this:
storage = ConcreteStorageBackend()
list_of_urls = [URL(url_str) for url_str in list_of_url_strings]
for url in list_of_urls:
storage.add_url(url)
storage.sync() # will trigger an internal loop that calls get_content() on all URLs.
It is easy to see that when list_of_urls
is huge caching get_content()
can cause memory problems. The solution here is, to replace (or manipulate) the
URL objects such that the new URL objects retrieve the data from the storage
backend.
list_of_urls = storage.sync()
where storage.sync()
returns new URL
instances.URL.get_content()
with a new function would allow the user
to be completely ignorant about performance considerations here.In both cases caching would mean to avoid downloading the content again and instead to obtain it from the storage backend. For this, getting data from the storage backend has to be faster than to download the content again but I assume this is the case quite often.
Caching with means of a storage backend leads to another cool benefit in many
cases: The OS kernel itself takes care of the cache. I assume that most storage
backends will eventually write to disk, e.g. to a file or to a database. At
least the Linux kernel caches disk IO in free RAM. Thus, as long as your RAM is
sufficiently big you'll get most of the url.get_content()
at the speed of
RAM. However, the cache does not block other more important memory allocations,
e.g. for some unrelated, numeric computation as mentioned in one comment.
Again, I am aware that this is not exactly matching the question above, as it relies on a storage backend and does not fall back to download data from internet again. I hope this answer is helpful for others anyways.
Upvotes: 1