Johnny Metz
Johnny Metz

Reputation: 5965

Fetch all Django objects that have a M2M relationship

I have the following two Django models:

class Parent(models.Model):
    name = models.CharField(max_length=50)
    children = models.ManyToManyField("Child", through="ParentChild")

    def __str__(self):
        return self.name

class Child(models.Model):
    name = models.CharField(max_length=50)

    def __str__(self):
        return self.name

Let's say I have four Parent objects in my database, only two of which have children:

for parent in Parent.objects.all():
    print(parent, parent.children.count())
# Parent1 3
# Parent2 0
# Parent3 0
# Parent4 1

My goal is to write an efficient database query to fetch all parents that have at least one child. In reality, I have millions of objects so I need this to be as efficient as possible. So far, I've come up with the following solutions:

  1. Using prefetch_related
for parent in Parent.objects.prefetch_related("children"):
    if parent.children.exists():
        print(parent)
# Parent1
# Parent4
  1. Using filter:
for parent in Parent.objects.filter(children__isnull=False).distinct():
    print(parent)
# Parent1
# Parent4
  1. Using exclude:
for parent in Parent.objects.exclude(children__isnull=True):
    print(parent)
# Parent1
# Parent4
  1. Using annotate and exclude:
for parent in Parent.objects.annotate(children_count=Count("children")).exclude(children_count=0):
    print(parent)
# Parent1
# Parent4

Which of these solutions is the fastest? Is there another approach that's even faster / more readable? I'm seeing a django Exists function but it doesn't appear to be applicable for this use case.

Upvotes: 2

Views: 342

Answers (2)

Jiří Baum
Jiří Baum

Reputation: 6930

This is a rather open-ended question...

  • First, as with all optimisation:

    • measure the baseline performance (on real or realistic data), so you know how much you've improved; and
    • determine what performance is satisfactory (from a business point of view), or what the benefits you expect from additional performance.
  • A couple of approaches:

    • Measure all the variants, on the same data as the baseline. It's the only way to be sure.

    • Django provides facilities to count (or log) all queries made; have you tried using these? In most cases, the fewer queries the faster, especially if the database is (or will be) on its own server.

    • Once you have the queries, check them on the database, using EXPLAIN or SHOWPLAN, making sure that they're using relevant indexes as appropriate. You'll need real or realistic data for this. (Optimising database queries is really a separate question; you don't even say what database you're using in this question...)

  • Finally, when you've achieved your objectives, stop. Don't keep optimising beyond the point of diminishing returns, or past the point of no business impact.

Upvotes: 0

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 476584

The .prefetch_related(…) will (likely) not help, since the .exists() will not use the prefetching, but make an exists query, and thus resulting in an N+1 problem.

You can simply filter on the fact that there is a non-NULL child, and use .distinct() to retrieve each parent once:

Parent.objects.filter(children__isnull=False).distinct()

You can also try with an Exists subquery:

from django.db.models import Exists, OuterRef

# since Django-3.0

Parent.objects.filter(Exists(
    ParentChild.objects.filter(parent_id=OuterRef('pk'))
))

before , you can use .annotate(…), but that likely makes it less efficient:

from django.db.models import Exists, OuterRef

# before Django-3.0

Parent.objects.annotate(has_children=Exists(
    ParentChild.objects.filter(parent_id=OuterRef('pk'))
)).filter(has_children=True)

The exact performance however depends on the databases, and therefore it is probably better to benchmark the queries. It also depends sometimes on the specific database system: a MySQL database can have different benchmarks than a PostgreSQL database.

Upvotes: 2

Related Questions