Reputation: 29863
I have a table like this:
id | person_id | created_at
---------------------------
0 | 10 | ...
1 | 10 | ...
2 | 11 | ...
3 | 11 | ...
.. | ... | ...
and I'm currently performing the following query:
SELECT * FROM table WHERE person_id IN (10,11,12,34,58) ORDER BY created_at DESC LIMIT x OFFSET y;
I basically want the records sorted by created_at
, but only the ones corresponding to any of the provided person_id
values.
Right I have two separate indices: created_at
and person_id
, and I've been asking myself the following:
(created_at, person_id)
? I'm trying to visualize how it would work, and I think it will still do a sequential scan (i.e. it has the data sorted by `created_at, and it will go record by record collecting the ones that match)(person_id, created_at)
instead?If my query would be WHERE person_id = 10
instead of IN
, I'm sure the (person_id, created_at)
would do the trick, but I'm not 100% sure in this scenario.
Upvotes: 2
Views: 58
Reputation: 1128
Short answer: An index on (created_at
) is most likely the best choice.
Long answer:
The query is evaluated by first applying the selection filter over the base relation (i.e., the WHERE person_id IN
clause) and then sorting the data on the create_at
column.
A (created_at, person_id) index is unlikely to help.Such an index sorts the whole dataset on the create_at
column and does not allow to efficiently apply the selection on the person_id
column. It is true that the query can do a slightly more efficient sequential scan over the index rather than scanning the base relation and that the qualifying data is an already sorted order. However, as you select all attributes (the select *
clause), the base table still needs to be accessed to retrieve the id
attribute for each result tuple.
A (person_id
index) is more likely to improve performance, especially if you are interested in only a few person_ids. This is because the selection on person_id
can be efficiently applied just by looking up values in the index (no base table or index scan). The benefit of such an index decreases as the selection predicate is less restrictive (as you provide more and more IDs and a larger percentage of the rows passes the where clause filter).
A (person_id, created_at) might help even more as the data for each qualifying person_id
is already sorted. The final sort operation on create_at
can in theory be optimized to exploit the partially sorted result tuples. However, not all database engines support such optimizations and they are not worth if you have only a few results tuples per person_id. The benefit of adding created_at
to the index is larger if you have many entries for a given person_id (lets say more than hundreds to thousands).
Adding more columns to an index is not always a good idea as you increase the size of the index and its maintenance cost. If create-at
is part of an index, an update to this column also triggers an index update on top of the update to the base relation.
Upvotes: 1