kofm
kofm

Reputation: 145

Django ORM query optimisation with multiple joins

In my app, I can describe an Entity using different Protocols, with each Protocol being a collection of various Traits, and each Trait allows two or more Classes. So, a Description is a collection of Expressions. E.g., I want to describe an entity "John" with the Protocol "X" that comprises the following two Traits and Classes:

Protocol ABC

Trait 1: Height

Available Classes: a. Short b. Medium c. Tall

Trait 2: Weight

Available Classes: a. Light b. Medium c. Heavy

John's Description: Expression 1: c. Tall, Expression 2: b. Medium

My model specification (barebone essentials for simplicity):

class Protocol(models.Model):
    """
    A Protocol is a collection of Traits
    """
    name = models.CharField()

class Trait(models.Model):
    """
    Stores the Traits. Each Trait can have multiple Classes
    """

    name = models.CharField()
    protocol = models.ForeignKey(
        Protocol,
        help_text="The reference protocol of the trait",
    )

class Class(models.Model):
    """
    Stores the different Classes related to a Trait.
    """

    name = models.CharField()
    trait = models.ForeignKey(Trait)

class Description(models.Model):
    """
    Stores the Descriptions. A description is a collection of Expressions.
    """

    name = models.CharField()
    protocol = models.ForeignKey(
        Protocol,
        help_text="reference to the protocol used to make the description;\
            this will define which Traits will be available",
    )
    entity = models.ForeignKey(
        Entity,
        help_text="the Entity to which the description refers to",
    )

class Expression(models.Model):
    """
    Stores the expressions of entities related to a specific
    Description. It refers to one particular Class (which is
    then associated with a specific Trait)
    """

    class = models.ForeignKey(Class)
    description = models.ForeignKey(Description)

Following the previous example, let's say I want to find all the Entities that are medium or tall (Trait 1) and heavy (Trait 2). The query I'm now using is the following:

# This is the filter returned by the HTML form, which list
# all the available Classes for each Trait of the selected Protocol
filters = [
  {'trait': 1, 'class': [2, 3]},
  {'trait': 2, 'class': [6,]},
]

queryset = Description.objects.all()

for filter in filters:
  queryset = queryset.filter(expression_set__class__in=filter["class"])

The problem is that the query is slow (I have ATM ~1000 Descriptions, described with a Protocol of 40 Traits, each Trait having 2 to 5 Classes). It takes about two seconds to return the results even when filtering by only 5-6 Expressions. I tried using prefetch_related("expression_set") or prefetch_related("expression_set__class") but with no significant improvement.

The question is: can you suggest a way to improve the performance, or this is simply the reality of searching through so many tables?

Thank you very much for your time.


EDIT: The following is the query generated by the Manager when, e.g., eight filters (see previous code snippet) are applied.

SELECT "describe_description"."id",
       "describe_description"."name",
       "describe_description"."protocol_id",
  FROM "describe_description"
 INNER JOIN "describe_expression"
    ON ("describe_description"."id" = "describe_expression"."description_id")
 INNER JOIN "describe_expression" T4
    ON ("describe_description"."id" = T4."description_id")
 INNER JOIN "describe_expression" T6
    ON ("describe_description"."id" = T6."description_id")
 INNER JOIN "describe_expression" T8
    ON ("describe_description"."id" = T8."description_id")
 INNER JOIN "describe_expression" T10
    ON ("describe_description"."id" = T10."description_id")
 INNER JOIN "describe_expression" T12
    ON ("describe_description"."id" = T12."description_id")
 INNER JOIN "describe_expression" T14
    ON ("describe_description"."id" = T14."description_id")
 INNER JOIN "describe_expression" T16
    ON ("describe_description"."id" = T16."description_id")
 INNER JOIN "describe_expression" T18
    ON ("describe_description"."id" = T18."description_id")
 WHERE ("describe_expression"."class_id" IN (732) AND T4."class_id" IN (740) AND T6."class_id" IN (760) AND T8."class_id" IN (783) AND T10."class_id" IN (794) AND T12."class_id" IN (851) AND T14."class_id" IN (857) AND T16."class_id" IN (860) AND T18."class_id" IN (874))

Upvotes: 4

Views: 1414

Answers (4)

preator
preator

Reputation: 1039

First you should avoid multiple joins by aggregating desired filters upfront:

filters = [
  {'trait': 1, 'class': [2, 3]},
  {'trait': 2, 'class': [6,]},
]

queryset = Description.objects.all()
class_filter = []
for filter_entry in filters:
    class_filter.append(filter_entry["class"])
queryset = queryset.filter(expression_set__class__in=class_filter)

Second problem is scanning for text values. Use db_index=True on your Class.name field.

EDIT: There is a difference in chaining the filters on the same table and using Q objects. It does not act like AND on the same object. It seems counterintuitive as in the sql you can see AND but that is AND on multiple joins, where each join effectively duplicates the descriptions (that is why it get's slow). Best explained in the Django docs or this article.

Quick excerpt of the docs:

To select all blogs containing at least one entry from 2008 having “Lennon” in its headline (the same entry satisfying both conditions), we would write:

Blog.objects.filter(entry__headline__contains='Lennon', entry__pub_date__year=2008)

Otherwise, to perform a more permissive query selecting any blogs with merely some entry with “Lennon” in its headline and some entry from 2008, we would write:

Blog.objects.filter(entry__headline__contains='Lennon').filter(entry__pub_date__year=2008)

EDIT 2: Schematic example of above from this answer:

enter image description here

Blog.objects.filter(entry__headline_contains='Lennon', 
entry__pub_date__year=2008)

filters only Blog 1

Blog.objects.filter(entry__headline_contains='Lennon').filter(
entry__pub_date__year=2008)  

filters blog 1 and 2

Upvotes: 2

thebjorn
thebjorn

Reputation: 27311

Different databases have different performance quirks, so without knowing what you're running against it's a bit of a shot in the dark, but have you tried Q-objects?

from django.db.models import QuerySet

q = Q()
for filter in filters:
    q |= Q(expression_set__class_in=filter['class'])
queryset.filter(q)

or simply pre-calculating the filter value (since you're already doing an in-query):

filtr = []
for f in filters:
    filtr += filter['class']
queryset.filter(expression_set__class_in=filtr)

or

queryset = Description.objects.filter(
    expression__class__in=Class.objects.filter(pk__in=filtr)  
)

(filtr as constructed above)

I'm assuming you've examined the query-analyzer output to make sure you're not missing any indexes etc...

Caching is probably not a good idea (since this is close to ad-hoc queries almost everything will be a cache miss), but 1000 * 40 * 5 probably fits in memory, so it might be a solution to read it all into memory if this is a time-critical piece that runs frequently.

Upvotes: 0

Gonçalo Peres
Gonçalo Peres

Reputation: 13582

To understand more about the queries, one can use Django Debug Toolbar. That's helpful to use because it's hard to know how we can improve if we're not able to measure the current state (as it appears to be the case).

Django has a page specific for database access optimization. In it, one can read, for instance, that QuerySets are lazy.

Since OP explored Django ORM and didn't get really good results from it, to improve performance OP may try to use raw SQL queries. In other words, write one's own SQL to retrieve the data. According to the documentation

Django gives you two ways of performing raw SQL queries: you can use Manager.raw() to perform raw queries and return model instances, or you can avoid the model layer entirely and execute custom SQL directly.

Another way to speed up the query can be to add indexes... The absence of such can slow down one's query.

Additionally, OP should consider using some cache, like MemCached. According to Alex Xu,

A cache is a temporary storage area that stores the result of expensive responses or frequently accessed data in memory so that subsequent requests are served more quickly. (...) The cache tier is a temporary data store layer, much faster than the database. The benefits of having a separate cache tier include better system performance, ability to reduce database workloads, and the ability to scale the cache tier independently.

Upvotes: 1

DialFrost
DialFrost

Reputation: 1770

It's slightly better to use multiple functions instead I think. It runs at the same speed as using classes, if not even faster. Check this question out. After you start using functions, you can try using @cached_property(func, name=None):

It’s common to have to call a class instance’s method more than once. If that function is expensive, then doing so can be wasteful.

Using the cached_property decorator saves the value returned by a property; the next time the function is called on that instance, it will return the saved value rather than re-computing it. Note that this only works on methods that take self as their only argument and that it changes the method to a property.

Consider a typical case, where a view might need to call a model’s method to perform some computation, before placing the model instance into the context, where the template might invoke the method once more:

# the model
class Person(models.Model):

    def friends(self):
        # expensive computation
        ...
        return friends

# in the view:
if person.friends():
    ...

And in the template you would have:

{% for friend in person.friends %}

Here, friends() will be called twice. Since the instance person in the view and the template are the same, decorating the friends() method with @cached_property can avoid that:

from django.utils.functional import cached_property

class Person(models.Model):

    @cached_property
    def friends(self):
        ...

Relevant questions and sources:

Upvotes: 1

Related Questions