Zac
Zac

Reputation: 481

Django ORM and SQL inner joins

I am trying to get all Horse objects which fall within a specific from_date and to_date range on a related listing object. eg.

Horse.objects.filter(listings__to_date__lt=to_date.datetime,
listings__from_date__gt=from_date.datetime)

Now as I understand this database query creates an inner join which then enables me to find all my horse objects based on the related listing dates.

My question is how this exactly works, it probably comes down to a major lack of understanding in how inner joins actually work. Would this query need to first 'check' each and ever horse object first to ascertain whether or not it has a related listing object? I'd imagine this could prove to be quite inefficient because you might have 5million horse objects with no related listing object yet you still would have to check each and every one first?

Alternatively I could start with my Listings and do something like this first:

Listing.objects.filter(to_date__lt=to_date.datetime, 
from_date__gt=from_date.datetime)

And then:

for listing in listing_objs:
    if listing.horse:
        horses.append(horse)

But this seems like a rather odd way of achieving my results too.

If anyone could help me understand how queries work in Django and which is the most efficient way to go about doing such a query it would be a great help!

This is my current model setup:

class Listing(models.Model):

    to_date = models.DateTimeField(null=True, blank=True)
    from_date = models.DateTimeField(null=True, blank=True)
    promoted_to_date = models.DateTimeField(null=True, blank=True)
    promoted_from_date = models.DateTimeField(null=True, blank=True)

    # Relationships
    horse = models.ForeignKey('Horse', related_name='listings', null=True, blank=True)

class Horse(models.Model):
    created_date = models.DateTimeField(null=True, blank=True, auto_now=True)
    type = models.CharField(max_length=200, null=True, blank=True)
    name = models.CharField(max_length=200, null=True, blank=True)
    age = models.IntegerField(null=True, blank=True)
    colour = models.CharField(max_length=200, null=True, blank=True)
    height = models.IntegerField(null=True, blank=True)

Upvotes: 3

Views: 2465

Answers (1)

Josh Smeaton
Josh Smeaton

Reputation: 48720

The way you write your query really depends on what information you want back most of the time. If you are interested in the horses, then query from Horse. If you're interested in listings then you should query from Listing. That's generally the correct thing to do, especially when you're working with simple foreign keys.

Your first query is probably the better one with regards to Django. I've used slightly simpler models to illustrate the differences. I've created an active field rather than using datetimes.

In [18]: qs = Horse.objects.filter(listings__active=True)

In [19]: print(qs.query)
SELECT 
"scratch_horse"."id", 
"scratch_horse"."name" 
FROM "scratch_horse" 
INNER JOIN "scratch_listing" 
ON ( "scratch_horse"."id" = "scratch_listing"."horse_id" ) 
WHERE "scratch_listing"."active" = True

The inner join in the query above will ensure that you only get horses that have a listing. (Most) databases are very good at using joins and indexes to filter out unwanted rows.

If Listing was very small, and Horse was rather large, then I would hope the database would only look at the Listing table, and then use an index to fetch the correct parts of Horse without doing a full table scan (inspecting every horse). You will need to run the query and check what your database is doing though. EXPLAIN (or whatever database you use) is extremely useful. If you're guessing what the database is doing, you're probably wrong.

Note that if you need to access the listings of each horse then you'll be executing another query each time you access horse.listings. prefetch_related can help you if you need to access listings, by executing a single query and storing it in cache.

Now, your second query:

In [20]: qs = Listing.objects.filter(active=True).select_related('horse')

In [21]: print(qs.query)
SELECT 
"scratch_listing"."id", 
"scratch_listing"."active", 
"scratch_listing"."horse_id", 
"scratch_horse"."id", 
"scratch_horse"."name" 
FROM "scratch_listing" 
LEFT OUTER JOIN "scratch_horse" 
ON ( "scratch_listing"."horse_id" = "scratch_horse"."id" ) 
WHERE "scratch_listing"."active" = True

This does a LEFT join, which means that the right hand side can contain NULL. The right hand side is Horse in this instance. This would perform very poorly if you had a lot of listings without a Horse, because it would bring back every single active listing, whether or not a horse was associated with it. You could fix that with .filter(active=True, horse__isnull=False) though.

See that I've used select_related, which joins the tables so that you're able to access listing.horse without incurring another query.

Now I should probably ask why all your fields are nullable. That's usually a terrible design choice, especially for ForeignKeys. Will you ever have a listing that's not associated with a horse? If not, get rid of the null. Will you ever have a horse that won't have a name? If not, get rid of the null.

So the answer is, do what seems natural most of the time. If you know a particular table is going to be large, then you must inspect the query planner (EXPLAIN), look into adding/using indexes on filter/join conditions, or querying from the other side of the relation.

Upvotes: 1

Related Questions