nanolab
nanolab

Reputation: 337

How to optimize PostgreSQL COUNT GROUP BY query?

I have a table parameters_products with approx 300k records. Is it possible to optimize this query?

SELECT parameter_id AS id,
       COUNT(product_id) AS COUNT
FROM "parameters_products"
WHERE product_id IN
    (SELECT product_id
     FROM parameters_products
     WHERE parameter_id IN ('2'))
GROUP BY parameter_id

Query output:

2;274669

EXPLAIN ANALYZE VERBOSE... output:

HashAggregate  (cost=23628.54..23628.56 rows=2 width=8) (actual time=2231.367..2231.368 rows=1 loops=1)
  Output: parameters_products.parameter_id, count(parameters_products.product_id)
  Group Key: parameters_products.parameter_id
  ->  Hash Semi Join  (cost=9607.86..22256.43 rows=274421 width=8) (actual time=692.586..1893.261 rows=274669 loops=1)
        Output: parameters_products.parameter_id, parameters_products.product_id
        Hash Cond: (parameters_products.product_id = parameters_products_1.product_id)
        ->  Seq Scan on public.parameters_products  (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.025..353.358 rows=299728 loops=1)
              Output: parameters_products.parameter_id, parameters_products.product_id
        ->  Hash  (cost=5105.60..5105.60 rows=274421 width=4) (actual time=692.331..692.331 rows=274669 loops=1)
              Output: parameters_products_1.product_id
              Buckets: 16384  Batches: 4  Memory Usage: 2425kB
              ->  Seq Scan on public.parameters_products parameters_products_1  (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.013..344.656 rows=274669 loops=1)
                    Output: parameters_products_1.product_id
                    Filter: (parameters_products_1.parameter_id = 2)
                    Rows Removed by Filter: 25059
Planning time: 0.279 ms
Execution time: 2231.499 ms

PostgreSQL 9.4.1 and VACUUM is enabled.

Just tried this quesry, but it is slow too:

SELECT pp1.parameter_id,
       count(pp1.product_id)
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
WHERE pp2.parameter_id IN (2)
GROUP BY pp1.parameter_id

--

HashAggregate  (cost=23742.42..23742.44 rows=2 width=8) (actual time=2361.654..2361.654 rows=1 loops=1)
  Output: pp1.parameter_id, count(pp1.product_id)
  Group Key: pp1.parameter_id
  ->  Hash Join  (cost=9607.86..22370.31 rows=274421 width=8) (actual time=715.409..2012.345 rows=274669 loops=1)
        Output: pp1.parameter_id, pp1.product_id
        Hash Cond: (pp1.product_id = pp2.product_id)
        ->  Seq Scan on public.parameters_products pp1  (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.012..360.789 rows=299728 loops=1)
              Output: pp1.parameter_id, pp1.product_id
        ->  Hash  (cost=5105.60..5105.60 rows=274421 width=4) (actual time=715.176..715.176 rows=274669 loops=1)
              Output: pp2.product_id
              Buckets: 16384  Batches: 4  Memory Usage: 2425kB
              ->  Seq Scan on public.parameters_products pp2  (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.009..353.386 rows=274669 loops=1)
                    Output: pp2.product_id
                    Filter: (pp2.parameter_id = 2)
                    Rows Removed by Filter: 25059
Planning time: 0.135 ms
Execution time: 2361.735 ms

Indexes:

CREATE INDEX parameters_products_parameter_id_idx
  ON parameters_products
  USING btree
  (parameter_id);

CREATE INDEX parameters_products_product_id_idx
  ON parameters_products
  USING btree
  (product_id);

CREATE INDEX parameters_products_product_id_parameter_id_idx
  ON parameters_products
  USING btree
  (product_id, parameter_id);

EXPLAIN ANALYZE VERBOSE
SELECT pp1.parameter_id
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id

-

Hash Left Join  (cost=9241.88..22699.06 rows=299728 width=4) (actual time=727.683..2080.798 rows=299728 loops=1)
  Output: pp1.parameter_id
  Hash Cond: (pp1.product_id = pp2.product_id)
  ->  Seq Scan on public.parameters_products pp1  (cost=0.00..4324.28 rows=299728 width=8) (actual time=0.031..355.656 rows=299728 loops=1)
        Output: pp1.parameter_id, pp1.product_id
  ->  Hash  (cost=4324.28..4324.28 rows=299728 width=4) (actual time=727.579..727.579 rows=299728 loops=1)
        Output: pp2.product_id
        Buckets: 16384  Batches: 4  Memory Usage: 2644kB
        ->  Seq Scan on public.parameters_products pp2  (cost=0.00..4324.28 rows=299728 width=4) (actual time=0.008..350.797 rows=299728 loops=1)
              Output: pp2.product_id
Planning time: 0.472 ms
Execution time: 2392.582 ms

SET enable_seqscan = OFF;

Decreased the execution time, but not significantly.

Upvotes: 2

Views: 6479

Answers (3)

AlVaz
AlVaz

Reputation: 766

RhodiumToad in #postgresql on freenode recommended a window function like the following. Note this is different than Gordon Linoff's window function by using bool_or instead of sum(case...):

SELECT parameter_id, count(product_id)
FROM
  (SELECT *, bool_or(parameter_id = 2)
   OVER
   (partition by product_id) AS matching
   FROM parameters_products) s
WHERE matching
GROUP BY parameter_id;

RhodiumToad also mentioned that the work_mem parameter may be too small for any query of this scale, whether using a window function, join, or subselect. He recommends increasing the work_mem parameter to avoid sorting routines from spilling to disk.

If either of these help you out, all credit to RhodiumToad.

Upvotes: 0

AlVaz
AlVaz

Reputation: 766

Try:

SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT
FROM parameters_products pp1
JOIN
  parameters_products pp2
ON
  pp2.parameter_id = 2
AND
  pp1.product_id = pp2.product_id
GROUP BY
  pp1.parameter_id

Moving the filter criteria from your WHERE clause to the ON clause reduces the total number of rows involved in the JOIN. Hopefully this demonstrates the same savings you saw in your comment that brought the execution time below 1s.

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269753

The first thing I would try is replacing the IN with an EXISTS:

SELECT parameter_id AS id,
       COUNT(product_id) AS COUNT
FROM parameters_products pp
WHERE EXISTS (SELECT 1
              FROM parameters_products pp2
              WHERE pp2.product_id = pp.product_id AND
                    pp2.parameter_id = 2
             ) 
GROUP BY parameter_id;

And, be sure you have an index on parameters_products(product_id, parameter_id).

Another idea is to use window functions:

select parameter_id, count(*)
from (select pp.*,
             sum(case when pp.parameter_id = 2 then 1 else 0 end) over (partition by product_id) as cnt2
      from parameters_products pp
     ) pp
where cnt2 > 0
group by parameter_id;

Upvotes: 3

Related Questions