Verbal_Kint
Verbal_Kint

Reputation: 1416

Query efficiency

This seemingly simple query (1 join) takes many hours to run even though the table containes less than 1.5 million rows...

I have Product items which have a one-to-many relationship with RetailerProduct items and I would like to find all Product items whose related RetailerProducts do not contain any instances of retailer_id=1.

There are about 1.5 million rows in Product, and about 1.1 million rows in RetailerProduct with retailer_id=1 (2.9 million in total in RetailerProduct)

Models:

class Product(models.Model):
    ...
    upc = models.CharField(max_length=96, unique=True)
    ...

class RetailerProduct(models.Model):
    ...
    product = models.ForeignKey('project.Product',
                                related_name='retailer_offerings',
                                on_delete=models.CASCADE,
                                null=True)
    ...

    class Meta:
        unique_together = (("retailer", "retailer_product_id", "retailer_sku"),)

Query:

Product.objects.exclude(
   retailer_offerings__retailer_id=1).values_list('upc', flat=True)

Generated SQL:

SELECT "project_product"."upc" FROM "project_product" 
 WHERE NOT ("project_product"."id" IN 
  (SELECT U1."product_id" AS Col1 FROM "project_retailerproduct" U1 
    WHERE (U1."retailer_id" = 1 AND U1."product_id" IS NOT NULL))
  )

Running that query takes hours. An EXPLAIN in the psql shell renders:

 QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Seq Scan on project_product  (cost=0.00..287784596160.17 rows=725892 width=13)
   Filter: (NOT (SubPlan 1))
   SubPlan 1
     ->  Materialize  (cost=0.00..393961.19 rows=998211 width=4)
           ->  Seq Scan on project_retailerproduct u1  (cost=0.00..385070.14 rows=998211 width=4)
                 Filter: ((product_id IS NOT NULL) AND (retailer_id = 1))
(6 rows)

I wanted to post the EXPLAIN ANALYZE but it's still running.

Why is the cost so high for Seq Scan on project_product? Any optimization suggestions?

Upvotes: 1

Views: 112

Answers (1)

e4c5
e4c5

Reputation: 53734

1.1 million rows in RetailerProduct with retailer_id=1 (2.9 million in total in RetailerProduct)

You are selecting 1.1 million rows out of 2.9 million rows. Even if you had an index on retailer_id it wouldn't be much use here. You are looking at almost half the table here. This will need a full table scan.

Then let us recall that WHERE NOT IN type queries are generally slow. In your case you are comparing the product_id column against 1.1 million rows. Having done that you are actually fetching the rows, which probably amounts to several hundred thousand rows. You might want to consider a LIMIT but even then the query probably wouldn't be a whole lot faster.

So this is not a query that can easily be optimized. You might want to use a completely different query. Here is an example raw query

SELECT "project_product"."upc" FROM "project_product" LEFT JOIN
 (SELECT product_id FROM "project_retailerproduct" 
  WHERE retailer_id = 1)
AS retailer 
ON project_product.id = retailer.product_id WHERE
WHERE retailer.product_id IS NULL

Upvotes: 3

Related Questions