Dave
Dave

Reputation: 11879

Practical (Django) Caching Strategy & Implementation? Cache long, Invalidate cache upon data change

I have a Django app that gets near-realtime data (tweets and votes), although updates occur only every minute or two on average. However we want to show the data by updating the site and api results right when it comes in.

We might see a whole ton of load on this site, so my initial thought is of course caching!

Is it practical to have some sort of Memcached cache that gets invalidated manually by another process or event? In other words, I would cache views for a long time, and then have new tweets and votes invalidate the entire view.

I'm not concerned about invalidating only some of the objects, and I considered subclassing the MemcachedCache backend to add some functionality following this strategy. But of course, Django's sessions also use Memcached as a write through cache, and I don't want to invalidate that.

Upvotes: 7

Views: 3802

Answers (2)

Dave
Dave

Reputation: 11879

Thanks to @rdegges suggestions, I was able to figure out a great way to do this.

I follow this paradigm:

  • Cache rendered template fragments and API calls for five minutes (or longer)
  • Invalidate the cache each time new data is added.
    • Simply invalidating the cache is better than recaching on save, because new cached data is generated automatically and organically when no cached data is found.
  • Manually invalidate the cache after I have done a full update (say from a tweet search), not on each object save.
    • This has the benefit of invalidating the cache a fewer number of times, but on the downside is not as automatic.

Here's all the code you need to do it this way:

from django.conf import settings
from django.core.cache import get_cache
from django.core.cache.backends.memcached import MemcachedCache
from django.utils.encoding import smart_str
from time import time

class NamespacedMemcachedCache(MemcachedCache):

    def __init__(self, *args, **kwargs):
        super(NamespacedMemcachedCache, self).__init__(*args, **kwargs)
        self.cache = get_cache(getattr(settings, 'REGULAR_CACHE', 'regular'))
        self.reset()

    def reset(self):
        namespace = str(time()).replace('.', '')
        self.cache.set('namespaced_cache_namespace', namespace, 0)
        # note that (very important) we are setting
        # this in the non namespaced cache, not our cache.
        # otherwise stuff would get crazy.
        return namespace

    def make_key(self, key, version=None):
        """Constructs the key used by all other methods. By default it
        uses the key_func to generate a key (which, by default,
        prepends the `key_prefix' and 'version'). An different key
        function can be provided at the time of cache construction;
        alternatively, you can subclass the cache backend to provide
        custom key making behavior.
        """
        if version is None:
            version = self.version

        namespace = self.cache.get('namespaced_cache_namespace')
        if not namespace:
            namespace = self.reset()
        return ':'.join([self.key_prefix, str(version), namespace, smart_str(key)])

This works by setting a version, or namespace, on each cached entry, and storing that version in the cache. The version is just the current epoch time when reset() is called.

You must specify your alternate non-namspaced cache with settings.REGULAR_CACHE, so the version number can be stored in a non-namespaced cache (so it doesn't get recursive!).

Whenever you add a bunch of data and want to clear your cache (assuming you have set this one as the default cache), just do:

from django.core.cache import cache
cache.clear()

You can access any cache with:

from django.core.cache import get_cache
some_cache = get_cache('some_cache_key')

Finally, I recommend you don't put your session in this cache. You can use this method to change the cache key for your session. (As settings.SESSION_CACHE_ALIAS).

Upvotes: 4

rdegges
rdegges

Reputation: 33824

Cache invalidation is probably the best way to handle the stuff you're trying to do. Based on your question's wording, I'm going to assume the following about your app:

  • You have some sort of API in place that is receiving new information updates and NOT doing polling. EG: Every minute or two you get an API request, and you store some information in your database.
  • You are already using Memcached to cache stuff for reading. Probably via a cronjob or similar process that periodically scans your database and updates your cache.

Assuming the above two things are true, cache invalidation is definitely the way to go. Here's the best way to do it in Django:

  1. A new API request comes into your server that contains new data to be stored. You save it in the database, and use a post save signal on your model class (EG: Tweet, Poll, etc.) to update your memcached data.
  2. A user visits your site and requests to read their most recent tweets, polls, or whatever.
  3. You pull the tweet, poll, etc., data out of memcached, and display it to them.

This is essentially what Django signals are meant for. They'll run automatically after your object is saved / updated, which is a great time to update your cache stores with the freshest information.

Doing it this way means that you'll never need to run a background job that periodically scans your database and updates your cache--your cache will always be up-to-date instantly with the latest data.

Upvotes: 7

Related Questions