Reputation: 1034

Slow query in postgres using count distinct

My goal is to create a query that will return a count of unique customers who have purchased in a 365 day window. I created the query below in postgres and and the resulting queries are very slow. My table is 812,024 rows of just order dates and customer id's. When I remove the distinct statement, I can get the query to return a result in approx 60 seconds, with it, I have yet to finish. I created an index on (order_date, id). I am a complete newb to SQL, this is really the first time i have ever done anything with it, and after trying to find a solution to this problem all day, I could not find anything that I could get to work, even though I have seen a lot about the slow performace of distinct.

SELECT
    (d1.Ordered) AS Ordered,
    COUNT(distinct d2.ID) Users
FROM
(
    SELECT order_date AS Ordered
    FROM orders
    GROUP BY order_date
) d1 
INNER JOIN
(
    SELECT order_date AS Ordered, id
    FROM orders
) d2
ON d2.Ordered BETWEEN d1.Ordered - 364 AND d1.Ordered
GROUP BY d1.Ordered
ORDER BY d1.Ordered

"Sort  (cost=3541596.30..3541596.80 rows=200 width=29)"
"  Sort Key: orders_1.order_date"
"  ->  HashAggregate  (cost=3541586.66..3541588.66 rows=200 width=29)"
"        ->  Nested Loop  (cost=16121.73..3040838.52 rows=100149627 width=29)"
"              ->  HashAggregate  (cost=16121.30..16132.40 rows=1110 width=4)"
"                    ->  Seq Scan on orders orders_1  (cost=0.00..14091.24 rows=812024 width=4)"
"              ->  Index Only Scan using x on orders  (cost=0.43..1822.70 rows=90225 width=29)"
"                    Index Cond: ((order_date >= (orders_1.order_date - 364)) AND (order_date <= orders_1.order_date))"

Upvotes: 2

Answers (3)

Erwin Brandstetter

Reputation: 656221

Assuming actual date types.

SELECT d.day, count(distinct o.id) AS users_past_year
FROM  (
   SELECT generate_series(min(order_date), max(order_date), '1 day')::date AS day
   FROM   orders         -- single query
   ) d
LEFT JOIN (              -- fold duplicates on same day right away
   SELECT id, order_date
   FROM   orders
   GROUP  BY 1,2
   ) o ON o.order_date >  d.day - interval '1 year' -- exclude
      AND o.order_date <= d.day                     -- include
GROUP  BY 1
ORDER  BY 1;

Folding multiple purchases from the same user on the same day first only makes sense if that's a common thing. Else it will be faster to omit that step and simply left-join to the table orders instead.

It's rather odd that orders.id would be the ID of the user. Should be named something like user_id.

If you are not comfortable with generate_series() in the SELECT list (which works just fine), you can replace that with a LATERAL JOIN in Postgres 9.3+.

FROM  (SELECT min(order_date) AS a
            , max(order_date) AS z FROM orders) x
    , generate_series(x.a, x.z, '1 day') AS d(day)
LEFT JOIN ...

Note that day is type timestamp in this case. Works the same. You may want to cast.

General performance tips

I understand this is a read-only table for a single user. This simplifies things.
You already seem to have an index:

CREATE INDEX orders_mult_idx ON orders (order_date, id);

That's good.

Some things to try:

Basics

Of course, the usual performance advice applies:
https://wiki.postgresql.org/wiki/Slow_Query_Questions
https://wiki.postgresql.org/wiki/Performance_Optimization

Streamline table

Cluster your table using this index once:

CLUSTER orders USING orders_mult_idx;

This should help a bit. It also effectively runs VACUUM FULL on the table, which removes any dead rows and compacts the table if applicable.

Better statistic

ALTER TABLE orders ALTER COLUMN number SET STATISTICS 1000;
ANALYZE orders;

Explanation here:

Configuration parameter work_mem in PostgreSQL on Linux

Allocate more RAM

Make sure you have ample resources allocated. In particular for shared_buffers and work_mem. You can do this temporarily for your session.

Experiment with planner methods

Try disabling nested loops (enable_nestloop) (in your session only). Maybe hash joins are faster. (I would be surprised, though.)

SET enable_nestedloop = off;
-- test ...

RESET enable_nestedloop;

Temporary table

Since this seems to be a "temporary table" by nature, you could try and make it an actual temporary table saved in RAM only. You need enough RAM to allocate enough temp_buffers. Detailed instructions:

How to delete duplicate entries?

Be sure to run ANALYZE manually. Temp tables are not covered by autovacuum.

Upvotes: 0

Clodoaldo Neto

Reputation: 125204

No need for the self-join, use generate_series

select
    g.order_date as "Ordered",
    count(distinct o.id) as "Users"
from
    generate_series(
        (select min(order_date) from orders),
        (select max(order_date) from orders),
        '1 day'
    ) g (order_date)
    left join
    orders o on o.order_date between g.order_date - 364 and g.order_date
group by 1
order by 1

Upvotes: 2

Craig Ringer

Reputation: 324265

You haven't shown your schema, so some guesswork here. Change column names etc as appropriate.

SELECT 
  count(DISTINCT users.user_id)
FROM users
INNER JOIN order_date ON (users.user_id = orders.user_id)
WHERE orders.order_date > current_date - INTERVAL '1' YEAR;

SELECT 
  count(users.user_id)
FROM users
INNER JOIN order_date ON (users.user_id = orders.user_id)
WHERE orders.order_date > current_date - INTERVAL '1' YEAR
GROUP BY users.user_id;

Upvotes: 1