Reputation: 337
I have a table parameters_products with approx 300k records. Is it possible to optimize this query?
SELECT parameter_id AS id,
COUNT(product_id) AS COUNT
FROM "parameters_products"
WHERE product_id IN
(SELECT product_id
FROM parameters_products
WHERE parameter_id IN ('2'))
GROUP BY parameter_id
Query output:
2;274669
EXPLAIN ANALYZE VERBOSE... output:
HashAggregate (cost=23628.54..23628.56 rows=2 width=8) (actual time=2231.367..2231.368 rows=1 loops=1)
Output: parameters_products.parameter_id, count(parameters_products.product_id)
Group Key: parameters_products.parameter_id
-> Hash Semi Join (cost=9607.86..22256.43 rows=274421 width=8) (actual time=692.586..1893.261 rows=274669 loops=1)
Output: parameters_products.parameter_id, parameters_products.product_id
Hash Cond: (parameters_products.product_id = parameters_products_1.product_id)
-> Seq Scan on public.parameters_products (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.025..353.358 rows=299728 loops=1)
Output: parameters_products.parameter_id, parameters_products.product_id
-> Hash (cost=5105.60..5105.60 rows=274421 width=4) (actual time=692.331..692.331 rows=274669 loops=1)
Output: parameters_products_1.product_id
Buckets: 16384 Batches: 4 Memory Usage: 2425kB
-> Seq Scan on public.parameters_products parameters_products_1 (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.013..344.656 rows=274669 loops=1)
Output: parameters_products_1.product_id
Filter: (parameters_products_1.parameter_id = 2)
Rows Removed by Filter: 25059
Planning time: 0.279 ms
Execution time: 2231.499 ms
PostgreSQL 9.4.1 and VACUUM is enabled.
Just tried this quesry, but it is slow too:
SELECT pp1.parameter_id,
count(pp1.product_id)
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
WHERE pp2.parameter_id IN (2)
GROUP BY pp1.parameter_id
--
HashAggregate (cost=23742.42..23742.44 rows=2 width=8) (actual time=2361.654..2361.654 rows=1 loops=1)
Output: pp1.parameter_id, count(pp1.product_id)
Group Key: pp1.parameter_id
-> Hash Join (cost=9607.86..22370.31 rows=274421 width=8) (actual time=715.409..2012.345 rows=274669 loops=1)
Output: pp1.parameter_id, pp1.product_id
Hash Cond: (pp1.product_id = pp2.product_id)
-> Seq Scan on public.parameters_products pp1 (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.012..360.789 rows=299728 loops=1)
Output: pp1.parameter_id, pp1.product_id
-> Hash (cost=5105.60..5105.60 rows=274421 width=4) (actual time=715.176..715.176 rows=274669 loops=1)
Output: pp2.product_id
Buckets: 16384 Batches: 4 Memory Usage: 2425kB
-> Seq Scan on public.parameters_products pp2 (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.009..353.386 rows=274669 loops=1)
Output: pp2.product_id
Filter: (pp2.parameter_id = 2)
Rows Removed by Filter: 25059
Planning time: 0.135 ms
Execution time: 2361.735 ms
Indexes:
CREATE INDEX parameters_products_parameter_id_idx
ON parameters_products
USING btree
(parameter_id);
CREATE INDEX parameters_products_product_id_idx
ON parameters_products
USING btree
(product_id);
CREATE INDEX parameters_products_product_id_parameter_id_idx
ON parameters_products
USING btree
(product_id, parameter_id);
EXPLAIN ANALYZE VERBOSE
SELECT pp1.parameter_id
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
-
Hash Left Join (cost=9241.88..22699.06 rows=299728 width=4) (actual time=727.683..2080.798 rows=299728 loops=1)
Output: pp1.parameter_id
Hash Cond: (pp1.product_id = pp2.product_id)
-> Seq Scan on public.parameters_products pp1 (cost=0.00..4324.28 rows=299728 width=8) (actual time=0.031..355.656 rows=299728 loops=1)
Output: pp1.parameter_id, pp1.product_id
-> Hash (cost=4324.28..4324.28 rows=299728 width=4) (actual time=727.579..727.579 rows=299728 loops=1)
Output: pp2.product_id
Buckets: 16384 Batches: 4 Memory Usage: 2644kB
-> Seq Scan on public.parameters_products pp2 (cost=0.00..4324.28 rows=299728 width=4) (actual time=0.008..350.797 rows=299728 loops=1)
Output: pp2.product_id
Planning time: 0.472 ms
Execution time: 2392.582 ms
SET enable_seqscan = OFF;
Decreased the execution time, but not significantly.
Upvotes: 2
Views: 6479
Reputation: 766
RhodiumToad in #postgresql on freenode recommended a window function like the following. Note this is different than Gordon Linoff's window function by using bool_or instead of sum(case...):
SELECT parameter_id, count(product_id)
FROM
(SELECT *, bool_or(parameter_id = 2)
OVER
(partition by product_id) AS matching
FROM parameters_products) s
WHERE matching
GROUP BY parameter_id;
RhodiumToad also mentioned that the work_mem parameter may be too small for any query of this scale, whether using a window function, join, or subselect. He recommends increasing the work_mem parameter to avoid sorting routines from spilling to disk.
If either of these help you out, all credit to RhodiumToad.
Upvotes: 0
Reputation: 766
Try:
SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT
FROM parameters_products pp1
JOIN
parameters_products pp2
ON
pp2.parameter_id = 2
AND
pp1.product_id = pp2.product_id
GROUP BY
pp1.parameter_id
Moving the filter criteria from your WHERE clause to the ON clause reduces the total number of rows involved in the JOIN. Hopefully this demonstrates the same savings you saw in your comment that brought the execution time below 1s.
Upvotes: 2
Reputation: 1269753
The first thing I would try is replacing the IN
with an EXISTS
:
SELECT parameter_id AS id,
COUNT(product_id) AS COUNT
FROM parameters_products pp
WHERE EXISTS (SELECT 1
FROM parameters_products pp2
WHERE pp2.product_id = pp.product_id AND
pp2.parameter_id = 2
)
GROUP BY parameter_id;
And, be sure you have an index on parameters_products(product_id, parameter_id)
.
Another idea is to use window functions:
select parameter_id, count(*)
from (select pp.*,
sum(case when pp.parameter_id = 2 then 1 else 0 end) over (partition by product_id) as cnt2
from parameters_products pp
) pp
where cnt2 > 0
group by parameter_id;
Upvotes: 3