Reputation: 14317
I have the following table structure:
AdPerformance
id
ad_id
impressions
Targeting
value
AdActions
app_starts
Ad
id
name
parent_id
AdTargeting
id
targeting_
ad_id
Targeting
id
name
value
AdProduct
id
ad_id
name
I need to aggregate the data by targeting with restriction to product name , so I wrote the following query:
SELECT ad_performance.ad_id, targeting.value AS targeting_value,
sum(impressions) AS impressions,
sum(app_starts) AS app_starts
FROM ad_performance
LEFT JOIN ad on ad.id = ad_performance.ad_id
LEFT JOIN ad_actions ON ad_performance.id = ad_actions.ad_performance_id
RIGHT JOIN (
SELECT ad_id, value from targeting, ad_targeting
WHERE targeting.id = ad_targeting.id and targeting.name = 'gender'
) targeting ON targeting.ad_id = ad.parent_id
WHERE ad_performance.ad_id IN
(SELECT ad_id FROM ad_product WHERE product = 'iphone')
GROUP BY ad_performance.ad_id, targeting_value
However the above query in ANALYZE
command takes about 5s for ~1M records.
Is there a way to improve it?
I do have indexes on foreign keys
UPDATED
See output of ANALYZE
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=5787.28..5789.87 rows=259 width=254) (actual time=3283.763..3286.015 rows=5998 loops=1)
Group Key: adobject_performance.ad_id, targeting.value
Buffers: shared hit=3400223
-> Nested Loop Left Join (cost=241.63..5603.63 rows=8162 width=254) (actual time=10.438..2774.664 rows=839720 loops=1)
Buffers: shared hit=3400223
-> Nested Loop (cost=241.21..1590.52 rows=8162 width=250) (actual time=10.412..703.818 rows=839720 loops=1)
Join Filter: (adobject.id = adobject_performance.ad_id)
Buffers: shared hit=36755
-> Hash Join (cost=240.78..323.35 rows=9 width=226) (actual time=10.380..20.332 rows=5998 loops=1)
Hash Cond: (ad_product.ad_id = ad.id)
Buffers: shared hit=190
-> HashAggregate (cost=128.98..188.96 rows=5998 width=4) (actual time=3.788..6.821 rows=5998 loops=1)
Group Key: ad_product.ad_id
Buffers: shared hit=39
-> Seq Scan on ad_product (cost=0.00..113.99 rows=5998 width=4) (actual time=0.011..1.726 rows=5998 loops=1)
Filter: ((product)::text = 'ft2_iPhone'::text)
Rows Removed by Filter: 1
Buffers: shared hit=39
-> Hash (cost=111.69..111.69 rows=9 width=222) (actual time=6.578..6.578 rows=5998 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 241kB
Buffers: shared hit=151
-> Hash Join (cost=30.26..111.69 rows=9 width=222) (actual time=0.154..4.660 rows=5998 loops=1)
Hash Cond: (adobject.parent_id = adobject_targeting.ad_id)
Buffers: shared hit=151
-> Seq Scan on adobject (cost=0.00..77.97 rows=897 width=8) (actual time=0.009..1.449 rows=6001 loops=1)
Buffers: shared hit=69
-> Hash (cost=30.24..30.24 rows=2 width=222) (actual time=0.132..0.132 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
Buffers: shared hit=82
-> Nested Loop (cost=0.15..30.24 rows=2 width=222) (actual time=0.101..0.129 rows=2 loops=1)
Buffers: shared hit=82
-> Seq Scan on targeting (cost=0.00..13.88 rows=2 width=222) (actual time=0.015..0.042 rows=79 loops=1)
Filter: (name = 'age group'::targeting_name)
Rows Removed by Filter: 82
Buffers: shared hit=1
-> Index Scan using advertising_targeting_pkey on adobject_targeting (cost=0.15..8.17 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=79)
Index Cond: (id = targeting.id)
Buffers: shared hit=81
-> Index Scan using "fki_advertising_peformance_advertising_entity_id -> advertising" on adobject_performance (cost=0.42..89.78 rows=4081 width=32) (actual time=0.007..0.046 rows=140 loops=5998)
Index Cond: (ad_id = ad_product.ad_id)
Buffers: shared hit=36565
-> Index Scan using facebook_advertising_actions_pkey on facebook_adobject_actions (cost=0.42..0.48 rows=1 width=12) (actual time=0.001..0.002 rows=1 loops=839720)
Index Cond: (ad_performance.id = ad_performance_id)
Buffers: shared hit=3363468
Planning time: 1.525 ms
Execution time: 3287.324 ms
(46 rows)
Upvotes: 0
Views: 92
Reputation: 246023
The execution plan does not seem to match the query any more (maybe you can update the query).
However, the problem now is here:
-> Hash Join (cost=30.26..111.69 rows=9 width=222)
(actual time=0.154..4.660 rows=5998 loops=1)
Hash Cond: (adobject.parent_id = adobject_targeting.ad_id)
Buffers: shared hit=151
-> Seq Scan on adobject (cost=0.00..77.97 rows=897 width=8)
(actual time=0.009..1.449 rows=6001 loops=1)
Buffers: shared hit=69
-> Hash (cost=30.24..30.24 rows=2 width=222)
(actual time=0.132..0.132 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
Buffers: shared hit=82
-> Nested Loop (cost=0.15..30.24 rows=2 width=222)
(actual time=0.101..0.129 rows=2 loops=1)
Buffers: shared hit=82
-> Seq Scan on targeting (cost=0.00..13.88 rows=2 width=222)
(actual time=0.015..0.042 rows=79 loops=1)
Filter: (name = 'age group'::targeting_name)
Rows Removed by Filter: 82
Buffers: shared hit=1
-> Index Scan using advertising_targeting_pkey on adobject_targeting
(cost=0.15..8.17 rows=1 width=8)
(actual time=0.001..0.001 rows=0 loops=79)
Index Cond: (id = targeting.id)
Buffers: shared hit=81
This is a join between adobject
and the result of
targeting JOIN adobject_targeting
USING (id)
WHERE targeting.name = 'age group'
The latter subquery is correctly estimated to 2 rows, but PostgreSQL fails to notice that almost all rows found in adobject
will match one of those two rows, so that the result of the join will be 6000 rather than the 9 it estimates.
This causes the optimizer to wrongly choose a nested loop join later on, where more than half of the query time is spent.
Unfortunately, since PostgreSQL doesn't have cross-table statistics, there is no way for PostgreSQL to know better.
One coarse measure is to SET enable_nestloop=off
, but that will deteriorate the performance of the other (correctly chosen) nested loop join, so I don't know if it will be a net win.
If that helps, you could consider changing the parameter only for the duration of the query (with a transaction and SET LOCAL
).
Maybe there is a way to rewrite the query so that a better plan can be found, but that is hard to say without knowing the exact query.
Upvotes: 1
Reputation: 222
I dont know if this query will solve your problem, but try it:
SELECT ad_performance.ad_id, targeting.value AS targeting_value,
sum(impressions) AS impressions,
sum(app_starts) AS app_starts
FROM ad_performance
LEFT JOIN ad on ad.id = ad_performance.ad_id
LEFT JOIN ad_actions ON ad_performance.id = ad_actions.ad_performance_id
RIGHT JOIN ad_targeting on ad_targeting.ad_id = ad.parent_id
INNER JOIN targeting on targeting.id = ad_targeting.id and targeting.name = 'gender'
INNER JOIN ad_product on ad_product.ad_id = ad_performance.ad_id
WHERE ad_product.product = 'iphone'
GROUP BY ad_performance.ad_id, targeting_value
perhaps you would create index on all columns that you are putting in ON or WHERE conditions
Upvotes: 1
Reputation: 1334
Blindly shooting here, as we have not been provided with the result of the EXPLAIN, but still, Postgres should treat this query better if you take out your targeting
table in a CTE:
WITH targeting AS
(
SELECT ad_id, value from targeting, ad_targeting
WHERE targeting.id = ad_targeting.id and targeting.name = 'gender'
)
SELECT ad_performance.ad_id, targeting.value AS targeting_value,
sum(impressions) AS impressions,
sum(app_starts) AS app_starts
FROM ad_performance
LEFT JOIN ad on ad.id = ad_performance.ad_id
LEFT JOIN ad_actions ON ad_performance.id = ad_actions.ad_performance_id
RIGHT JOIN targeting ON targeting.ad_id = ad.parent_id
WHERE ad_performance.ad_id IN
(SELECT ad_id FROM ad_product WHERE product = 'iphone')
GROUP BY ad_performance.ad_id, targeting_value
Taken from the Documentation:
A useful property of WITH queries is that they are evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries. Thus, expensive calculations that are needed in multiple places can be placed within a WITH query to avoid redundant work. Another possible application is to prevent unwanted multiple evaluations of functions with side-effects.
Upvotes: 2