Rich
Rich

Reputation: 1779

Django data sharding

I have successfully got my application running over several databases using the routing scheme based on models. I.e. model A lives on DB A and model B lives on DB B. I now need to shard my data. I am looking at the docs and having trouble working out how to do it as the same model needs to exist on multiple database servers. I want to have a flag to say DB for NEW members is now database X and that members X-Y live on database N etc.

How do I do that? Is it using **hints, this seems inadequately documented to me.

Upvotes: 5

Views: 2233

Answers (2)

MiniQuark
MiniQuark

Reputation: 48456

The hints parameter is designed to help your database router decide where it should read or write its data. It may evolve with future versions of python, but for now there's just one kind of hint that may be given by the Django framework, and that's the instance it's working on.

I wrote this very simple database router to see what Django does:

# routers.py
import logging
logger = logging.getLogger("my_project")

class DebugRouter(object):
    """A debugging router"""

    def db_for_read(self, model, **hints):
        logger.debug("db_for_read %s" % repr((model, hints)))
        return None

    def db_for_write(self, model, **hints):
        logger.debug("db_for_write %s" % repr((model, hints)))
        return None

    def allow_relation(self, obj1, obj2, **hints):
        logger.debug("allow_relation %s" % repr((obj1, obj2, hints)))
        return None

    def allow_syncdb(self, db, model):
        logger.debug("allow_syncdb %s" % repr((db, model)))
        return None

You declare this in settings.py:

DATABASE_ROUTERS = ["my_project.routers.DebugRouter"]

Make sure logging is properly configured to output debug output (for example to stderr):

LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'handlers': {
        [...some other handlers...] 
        'stderr': {
            'level': 'DEBUG',
            'class': 'logging.StreamHandler'
        }
    },
    'loggers': {
        [...some other loggers...]
        'my_project': {
            'handlers': ['stderr'],
            'level': 'DEBUG',
            'propagate': True,
        },
    }
}

Then you can open a Django shell and test a few requests to see what data your router is being given:

$ ./manage.py shell
[...]
>>> from my_project.my_app.models import User
>>> User.objects.get(pk = 1234)
db_for_read (<class 'my_project.my_app.models.User'>, {})
<User: User object>
>>> user = User.objects.create(name = "Arthur", title = "King")
db_for_write (<class 'my_project.my_app.models.User'>, {})
>>> user.name = "Kong"
>>> user.save()
db_for_write (<class 'my_project.my_app.models.User'>, {'instance':
              <User: User object>})
>>>

As you can see, the hints is always empty when no instance is available (in memory) yet. So you cannot use routers if you need query parameters (the object's id for example) in order to determine which database to query. It might be possible in the future if Django provides the query or queryset objects in the hints dict.

So to answer your question, I would say that for now you must create a custom Manager, as suggested by Aaron Merriam. But overriding just the create method is not enough, since you also need to be able to fetch an object in the appropriate database. Something like this might work (not tested yet):

class CustomManager(models.Manager)
    def self.find_database_alias(self, pk):
        return #... implement the logic to determine the shard from the pk

    def self.new_object_database_alias(self):
        return #... database alias for a new object

    def get(self, *args, **kargs):
        pk = kargs.get("pk")
        if pk is None:
            raise Exception("Sharded table: you must provide the primary key")
        db_alias = self.find_database_alias(pk)
        qs = self.get_query_set().using(db_alias)
        return qs.get(*args, **kargs)

    def create(self, *args, **kwargs):
        db_alias = self.new_object_database_alias()
        qs = super(CustomManager, self).using(db_alias)
        return qs.create(*args, **kwargs)

class ModelA(models.Model):
    objects = CustomManager()

Cheers

Upvotes: 6

Piper Merriam
Piper Merriam

Reputation: 2954

using should allow you to designate which database you want to use.

subclassing the create method might accomplish what you're looking to do.

class CustomManager(models.Manager)
    def get_query_set(self):
        return super(CustomManager, self).get_query_set()

    def create(self, *args, **kwargs):
        return super(CustomManager, self).using('OTHER_DB').create(*args, **kwargs)

class ModelA(models.Model):
    objects = CustomManager()

I have not tested this so I don't know if you can tack a 'create' onto the end of a 'using'

Upvotes: 3

Related Questions