Reputation: 5965
I have the following two Django models:
class Parent(models.Model):
name = models.CharField(max_length=50)
children = models.ManyToManyField("Child", through="ParentChild")
def __str__(self):
return self.name
class Child(models.Model):
name = models.CharField(max_length=50)
def __str__(self):
return self.name
Let's say I have four Parent
objects in my database, only two of which have children:
for parent in Parent.objects.all():
print(parent, parent.children.count())
# Parent1 3
# Parent2 0
# Parent3 0
# Parent4 1
My goal is to write an efficient database query to fetch all parents that have at least one child. In reality, I have millions of objects so I need this to be as efficient as possible. So far, I've come up with the following solutions:
prefetch_related
for parent in Parent.objects.prefetch_related("children"):
if parent.children.exists():
print(parent)
# Parent1
# Parent4
filter
:for parent in Parent.objects.filter(children__isnull=False).distinct():
print(parent)
# Parent1
# Parent4
exclude
:for parent in Parent.objects.exclude(children__isnull=True):
print(parent)
# Parent1
# Parent4
annotate
and exclude
:for parent in Parent.objects.annotate(children_count=Count("children")).exclude(children_count=0):
print(parent)
# Parent1
# Parent4
Which of these solutions is the fastest? Is there another approach that's even faster / more readable? I'm seeing a django Exists function but it doesn't appear to be applicable for this use case.
Upvotes: 2
Views: 342
Reputation: 6930
This is a rather open-ended question...
First, as with all optimisation:
A couple of approaches:
Measure all the variants, on the same data as the baseline. It's the only way to be sure.
Django provides facilities to count (or log) all queries made; have you tried using these? In most cases, the fewer queries the faster, especially if the database is (or will be) on its own server.
Once you have the queries, check them on the database, using EXPLAIN or SHOWPLAN, making sure that they're using relevant indexes as appropriate. You'll need real or realistic data for this. (Optimising database queries is really a separate question; you don't even say what database you're using in this question...)
Finally, when you've achieved your objectives, stop. Don't keep optimising beyond the point of diminishing returns, or past the point of no business impact.
Upvotes: 0
Reputation: 476584
The .prefetch_related(…)
will (likely) not help, since the .exists()
will not use the prefetching, but make an exists query, and thus resulting in an N+1 problem.
You can simply filter on the fact that there is a non-NULL
child, and use .distinct()
to retrieve each parent once:
Parent.objects.filter(children__isnull=False).distinct()
You can also try with an Exists
subquery:
from django.db.models import Exists, OuterRef
# since Django-3.0
Parent.objects.filter(Exists(
ParentChild.objects.filter(parent_id=OuterRef('pk'))
))
before django-3.0, you can use .annotate(…)
, but that likely makes it less efficient:
from django.db.models import Exists, OuterRef
# before Django-3.0
Parent.objects.annotate(has_children=Exists(
ParentChild.objects.filter(parent_id=OuterRef('pk'))
)).filter(has_children=True)
The exact performance however depends on the databases, and therefore it is probably better to benchmark the queries. It also depends sometimes on the specific database system: a MySQL database can have different benchmarks than a PostgreSQL database.
Upvotes: 2