Reputation: 1977
I have a table call_logs and it contains an id, device_id, timestamp variable along with some other fields. I am currently trying to write a query that returns the last call, if it is working for each device. Currently my query is this:
SELECT DISTINCT ON (device_id) c.device_id, c.timestamp, c.working, c.id
FROM call_logs c
ORDER BY c.device_id, c.timestamp desc;
and it returns the information I want. But my production server is now getting rather large and I have around 6,000,000 records in the table.
I have added an index to this table:
CREATE INDEX cl_device_timestamp
ON public.call_logs USING btree
(device_id, timestamp DESC, id, working)
TABLESPACE pg_default;
But I am getting what I consider to be very slow time: Here is an explain analyse f the query:
EXPLAIN ANALYSE SELECT DISTINCT ON (device_id) c.device_id, c.timestamp, c.working, c.id
FROM call_logs c
ORDER BY c.device_id, c.timestamp desc;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.56..363803.37 rows=120 width=25) (actual time=0.069..2171.201 rows=124 loops=1)
-> Index Only Scan using cl_device_timestamp on call_logs c (cost=0.56..347982.87 rows=6328197 width=25) (actual time=0.067..1594.953 rows=6331024 loops=1)
Heap Fetches: 8051
Planning time: 0.184 ms
Execution time: 2171.281 ms
(5 rows)
I only have 124 unique device_id. I would not have thought this would be a slow process with the index? Any ideas what is going wrong? Or why it is so slow?
Upvotes: 1
Views: 705
Reputation: 1977
I ended up doing this:
SELECT DISTINCT d.id, c.timestamp, c.id, c.working
FROM devices d
INNER JOIN call_logs c on d.id = c.device_id AND c.timestamp = (SELECT max(t.timestamp) FROM call_logs t WHERE t.device_id = d.id)
and it ended up being alot better
Unique (cost=607.92..608.06 rows=11 width=25) (actual time=3.291..3.344 rows=117 loops=1)
-> Sort (cost=607.92..607.95 rows=11 width=25) (actual time=3.289..3.310 rows=117 loops=1)
Sort Key: d.id, c."timestamp", c.id, c.working
Sort Method: quicksort Memory: 34kB
-> Nested Loop (cost=1.05..607.73 rows=11 width=25) (actual time=0.057..3.162 rows=117 loops=1)
-> Seq Scan on devices d (cost=0.00..4.18 rows=118 width=8) (actual time=0.006..0.029 rows=119 loops=1)
-> Index Only Scan using cl_device_timestamp on call_logs c (cost=1.05..5.10 rows=1 width=25) (actual time=0.007..0.007 rows=1 loops=119)
Index Cond: ((device_id = d.id) AND ("timestamp" = (SubPlan 2)))
Heap Fetches: 110
SubPlan 2
-> Result (cost=0.48..0.49 rows=1 width=8) (actual time=0.018..0.018 rows=1 loops=119)
InitPlan 1 (returns $1)
-> Limit (cost=0.43..0.48 rows=1 width=8) (actual time=0.017..0.017 rows=1 loops=119)
-> Index Only Scan Backward using test1 on call_logs t (cost=0.43..2674.01 rows=52483 width=8) (actual time=0.017..0.017 rows=1 loops=119)
Index Cond: ((device_id = d.id) AND ("timestamp" IS NOT NULL))
Heap Fetches: 110
Planning time: 0.645 ms
Execution time: 3.461 ms
(18 rows)
Upvotes: 1
Reputation: 51649
your index is on 4 columns, not one. You cant estimate the size and efficiency on composite index based on one out of four columns data distribution.
Next - the fact that you have only 124 distinct devices does not mean faster index. Opposite - the less distinctive values split tree to less parts, and thus parts are bigger. Eg bigserial on one million bigint values has a million distinct values and the exact id is got very fast. While boolean column index scan has only two(three) values and thus takes much longer.
Last argument - two seconds is very slow, indeed. But considering the fact you scan 6 million rows, comparing the timestamp, 2 seconds become quite acceptable I'd say.
You can sacrifice OLTP speed and create some trigger that will save last timestamp on data change for each device, etc pairs to some external table. Then selecting such pre-aggregated values from short external table will take microseconds for 127 devices.
Upvotes: 0