Dejell
Dejell

Reputation: 14317

Improve running time of SQL query

I have the following table structure:

AdPerformance
   id
   ad_id
   impressions

Targeting
  value


AdActions
   app_starts

Ad
  id
  name
  parent_id

AdTargeting
  id
  targeting_
  ad_id

Targeting
  id
  name
  value

AdProduct
  id
  ad_id
  name

I need to aggregate the data by targeting with restriction to product name , so I wrote the following query:

 SELECT ad_performance.ad_id, targeting.value AS targeting_value, 
     sum(impressions) AS impressions, 
     sum(app_starts) AS app_starts
 FROM ad_performance
     LEFT JOIN ad on ad.id = ad_performance.ad_id
     LEFT JOIN ad_actions ON ad_performance.id = ad_actions.ad_performance_id
     RIGHT JOIN (
        SELECT ad_id, value from targeting, ad_targeting 
        WHERE targeting.id = ad_targeting.id and targeting.name = 'gender' 
     ) targeting ON targeting.ad_id = ad.parent_id
WHERE ad_performance.ad_id IN 
       (SELECT ad_id FROM ad_product WHERE product = 'iphone')
GROUP BY ad_performance.ad_id, targeting_value

However the above query in ANALYZE command takes about 5s for ~1M records.

Is there a way to improve it?

I do have indexes on foreign keys

UPDATED

See output of ANALYZE

                                                                                                                                                                                                          QUERY PLAN                                                                                                     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=5787.28..5789.87 rows=259 width=254) (actual time=3283.763..3286.015 rows=5998 loops=1)
   Group Key: adobject_performance.ad_id, targeting.value
   Buffers: shared hit=3400223
   ->  Nested Loop Left Join  (cost=241.63..5603.63 rows=8162 width=254) (actual time=10.438..2774.664 rows=839720 loops=1)
         Buffers: shared hit=3400223
         ->  Nested Loop  (cost=241.21..1590.52 rows=8162 width=250) (actual time=10.412..703.818 rows=839720 loops=1)
               Join Filter: (adobject.id = adobject_performance.ad_id)
               Buffers: shared hit=36755
               ->  Hash Join  (cost=240.78..323.35 rows=9 width=226) (actual time=10.380..20.332 rows=5998 loops=1)
                     Hash Cond: (ad_product.ad_id = ad.id)
                     Buffers: shared hit=190
                     ->  HashAggregate  (cost=128.98..188.96 rows=5998 width=4) (actual time=3.788..6.821 rows=5998 loops=1)
                           Group Key: ad_product.ad_id
                           Buffers: shared hit=39
                           ->  Seq Scan on ad_product  (cost=0.00..113.99 rows=5998 width=4) (actual time=0.011..1.726 rows=5998 loops=1)
                                 Filter: ((product)::text = 'ft2_iPhone'::text)
                                 Rows Removed by Filter: 1
                                 Buffers: shared hit=39
                     ->  Hash  (cost=111.69..111.69 rows=9 width=222) (actual time=6.578..6.578 rows=5998 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 241kB
                           Buffers: shared hit=151
                           ->  Hash Join  (cost=30.26..111.69 rows=9 width=222) (actual time=0.154..4.660 rows=5998 loops=1)
                                 Hash Cond: (adobject.parent_id = adobject_targeting.ad_id)
                                 Buffers: shared hit=151
                                 ->  Seq Scan on adobject  (cost=0.00..77.97 rows=897 width=8) (actual time=0.009..1.449 rows=6001 loops=1)
                                       Buffers: shared hit=69
                                 ->  Hash  (cost=30.24..30.24 rows=2 width=222) (actual time=0.132..0.132 rows=2 loops=1)
                                       Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                       Buffers: shared hit=82
                                       ->  Nested Loop  (cost=0.15..30.24 rows=2 width=222) (actual time=0.101..0.129 rows=2 loops=1)
                                             Buffers: shared hit=82
                                             ->  Seq Scan on targeting  (cost=0.00..13.88 rows=2 width=222) (actual time=0.015..0.042 rows=79 loops=1)
                                                   Filter: (name = 'age group'::targeting_name)
                                                   Rows Removed by Filter: 82
                                                   Buffers: shared hit=1
                                             ->  Index Scan using advertising_targeting_pkey on adobject_targeting  (cost=0.15..8.17 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=79)
                                                   Index Cond: (id = targeting.id)
                                                   Buffers: shared hit=81
               ->  Index Scan using "fki_advertising_peformance_advertising_entity_id -> advertising" on adobject_performance  (cost=0.42..89.78 rows=4081 width=32) (actual time=0.007..0.046 rows=140 loops=5998)
                     Index Cond: (ad_id = ad_product.ad_id)
                     Buffers: shared hit=36565
         ->  Index Scan using facebook_advertising_actions_pkey on facebook_adobject_actions  (cost=0.42..0.48 rows=1 width=12) (actual time=0.001..0.002 rows=1 loops=839720)
               Index Cond: (ad_performance.id = ad_performance_id)
               Buffers: shared hit=3363468
 Planning time: 1.525 ms
 Execution time: 3287.324 ms
(46 rows)

Upvotes: 0

Views: 92

Answers (3)

Laurenz Albe
Laurenz Albe

Reputation: 246023

The execution plan does not seem to match the query any more (maybe you can update the query).

However, the problem now is here:

->  Hash Join  (cost=30.26..111.69 rows=9 width=222)
               (actual time=0.154..4.660 rows=5998 loops=1)
      Hash Cond: (adobject.parent_id = adobject_targeting.ad_id)
      Buffers: shared hit=151
      ->  Seq Scan on adobject  (cost=0.00..77.97 rows=897 width=8)
                                (actual time=0.009..1.449 rows=6001 loops=1)
            Buffers: shared hit=69
      ->  Hash  (cost=30.24..30.24 rows=2 width=222)
                (actual time=0.132..0.132 rows=2 loops=1)
            Buckets: 1024  Batches: 1  Memory Usage: 1kB
            Buffers: shared hit=82
            ->  Nested Loop  (cost=0.15..30.24 rows=2 width=222)
                             (actual time=0.101..0.129 rows=2 loops=1)
                  Buffers: shared hit=82
                  ->  Seq Scan on targeting  (cost=0.00..13.88 rows=2 width=222)
                                             (actual time=0.015..0.042 rows=79 loops=1)
                        Filter: (name = 'age group'::targeting_name)
                        Rows Removed by Filter: 82
                        Buffers: shared hit=1
                  ->  Index Scan using advertising_targeting_pkey on adobject_targeting
                                             (cost=0.15..8.17 rows=1 width=8)
                                             (actual time=0.001..0.001 rows=0 loops=79)
                        Index Cond: (id = targeting.id)
                        Buffers: shared hit=81

This is a join between adobject and the result of

targeting JOIN adobject_targeting
   USING (id)
WHERE targeting.name = 'age group'

The latter subquery is correctly estimated to 2 rows, but PostgreSQL fails to notice that almost all rows found in adobject will match one of those two rows, so that the result of the join will be 6000 rather than the 9 it estimates.

This causes the optimizer to wrongly choose a nested loop join later on, where more than half of the query time is spent.

Unfortunately, since PostgreSQL doesn't have cross-table statistics, there is no way for PostgreSQL to know better.

One coarse measure is to SET enable_nestloop=off, but that will deteriorate the performance of the other (correctly chosen) nested loop join, so I don't know if it will be a net win. If that helps, you could consider changing the parameter only for the duration of the query (with a transaction and SET LOCAL).

Maybe there is a way to rewrite the query so that a better plan can be found, but that is hard to say without knowing the exact query.

Upvotes: 1

RMH
RMH

Reputation: 222

I dont know if this query will solve your problem, but try it:

 SELECT ad_performance.ad_id, targeting.value AS targeting_value, 
     sum(impressions) AS impressions, 
     sum(app_starts) AS app_starts
 FROM ad_performance
     LEFT JOIN ad on ad.id = ad_performance.ad_id
     LEFT JOIN ad_actions ON ad_performance.id = ad_actions.ad_performance_id
     RIGHT JOIN ad_targeting on ad_targeting.ad_id = ad.parent_id
     INNER JOIN targeting on  targeting.id = ad_targeting.id and targeting.name = 'gender'   
     INNER JOIN ad_product on ad_product.ad_id = ad_performance.ad_id
WHERE ad_product.product = 'iphone'
GROUP BY ad_performance.ad_id, targeting_value

perhaps you would create index on all columns that you are putting in ON or WHERE conditions

Upvotes: 1

Georgi Raychev
Georgi Raychev

Reputation: 1334

Blindly shooting here, as we have not been provided with the result of the EXPLAIN, but still, Postgres should treat this query better if you take out your targeting table in a CTE:

WITH targeting AS 
(
        SELECT ad_id, value from targeting, ad_targeting 
        WHERE targeting.id = ad_targeting.id and targeting.name = 'gender' 
)
SELECT ad_performance.ad_id, targeting.value AS targeting_value, 
     sum(impressions) AS impressions, 
     sum(app_starts) AS app_starts
FROM ad_performance
     LEFT JOIN ad on ad.id = ad_performance.ad_id
     LEFT JOIN ad_actions ON ad_performance.id = ad_actions.ad_performance_id
     RIGHT JOIN  targeting ON targeting.ad_id = ad.parent_id
WHERE ad_performance.ad_id IN 
       (SELECT ad_id FROM ad_product WHERE product = 'iphone')
GROUP BY ad_performance.ad_id, targeting_value

Taken from the Documentation:

A useful property of WITH queries is that they are evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries. Thus, expensive calculations that are needed in multiple places can be placed within a WITH query to avoid redundant work. Another possible application is to prevent unwanted multiple evaluations of functions with side-effects.

Upvotes: 2

Related Questions