Reputation: 5107
I have a Postgres table with anonymous_id
(string) and timestamp
(datetime) columns created by Segment.com when users visit our website.
There are ~5M rows, ~1M distinct anonymous_id
I'd like to query the number of distinct anonymous_ids found per month.
I have this so far, which works, but timesout in PSequel (I can run it several times and restrict the date)
SELECT count(1), "month"
FROM (
SELECT DISTINCT anonymous_id,
date_trunc('month', "timestamp") as "month"
FROM pages
-- WHERE "timestamp" between '2018-01-01' and '2018-02-01'
) as dt
GROUP BY 2
ORDER BY 2
I have an index on both anonymous_id and timestamp
Results of EXPLAIN ANALYSE
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1667977.72..1667978.22 rows=200 width=8) (actual time=115861.803..115861.807 rows=27 loops=1)
Sort Key: (date_trunc('month'::text, pages."timestamp"))
Sort Method: quicksort Memory: 26kB
-> HashAggregate (cost=1667968.07..1667970.07 rows=200 width=8) (actual time=115861.763..115861.766 rows=27 loops=1)
Group Key: (date_trunc('month'::text, pages."timestamp"))
-> Unique (cost=1554502.82..1592324.57 rows=5042900 width=45) (actual time=97492.062..115468.396 rows=1158934 loops=1)
-> Sort (cost=1554502.82..1567110.07 rows=5042900 width=45) (actual time=97492.060..113983.496 rows=5042900 loops=1)
Sort Key: pages.anonymous_id, (date_trunc('month'::text, pages."timestamp"))
Sort Method: external merge Disk: 285936kB
-> Seq Scan on pages (cost=0.00..682820.25 rows=5042900 width=45) (actual time=0.088..25601.944 rows=5042900 loops=1)
Planning time: 10.335 ms
Execution time: 115910.353 ms
(12 rows)
Current Indexes (including combined index as suggested by Thorsten Kettner below)
Indexes:
"pages_pkey" PRIMARY KEY, btree (id)
"idx_anonymous_id" btree (anonymous_id)
"idx_date_trunc_anon_id" btree (date_trunc('month'::text, timezone('UTC'::text, "timestamp")), anonymous_id)
"idx_path" btree (path)
"idx_timestamp" btree ("timestamp")
"idx_url" btree (url)
"idx_user_id" btree (user_id)
"pages_activity_type_idx" btree (activity_type)
Upvotes: 3
Views: 1782
Reputation: 94939
You want an index that starts with your group by
expression. Ideally it also contains the anonymous_id
:
CREATE INDEX idx1 ON pages( date_trunc('month', "timestamp") , anonymous_id);
If you want a WHERE
clause then you need its criteria first:
CREATE INDEX idx2 ON pages( "timestamp" , date_trunc('month', "timestamp") , anonymous_id);
As has been shown by Greg Viers your query is more complicated than needed. I'd go with his one. The index needed is the same for both queries.
Upvotes: 0
Reputation: 3523
Only thing I can think of is get rid of the derived table
, since you don't need it:
SELECT count(distinct anonymous_id), date_trunc('month', "timestamp") AS "month"
FROM pages
GROUP BY date_trunc('month', "timestamp")
ORDER BY date_trunc('month', "timestamp");
Upvotes: 1