Oscar Mederos
Oscar Mederos

Reputation: 29863

Multi-column performance of IN clause + ORDER BY

I have a table like this:

id | person_id | created_at
---------------------------
0  | 10        | ...
1  | 10        | ...
2  | 11        | ...
3  | 11        | ...
.. | ...       | ... 

and I'm currently performing the following query:

SELECT * FROM table WHERE person_id IN (10,11,12,34,58) ORDER BY created_at DESC LIMIT x OFFSET y;

I basically want the records sorted by created_at, but only the ones corresponding to any of the provided person_id values.

Right I have two separate indices: created_at and person_id, and I've been asking myself the following:

If my query would be WHERE person_id = 10 instead of IN, I'm sure the (person_id, created_at) would do the trick, but I'm not 100% sure in this scenario.

Upvotes: 2

Views: 58

Answers (1)

Radu
Radu

Reputation: 1128

Short answer: An index on (created_at) is most likely the best choice.

Long answer:

  • The query is evaluated by first applying the selection filter over the base relation (i.e., the WHERE person_id IN clause) and then sorting the data on the create_at column.

  • A (created_at, person_id) index is unlikely to help.Such an index sorts the whole dataset on the create_at column and does not allow to efficiently apply the selection on the person_id column. It is true that the query can do a slightly more efficient sequential scan over the index rather than scanning the base relation and that the qualifying data is an already sorted order. However, as you select all attributes (the select * clause), the base table still needs to be accessed to retrieve the id attribute for each result tuple.

  • A (person_id index) is more likely to improve performance, especially if you are interested in only a few person_ids. This is because the selection on person_id can be efficiently applied just by looking up values in the index (no base table or index scan). The benefit of such an index decreases as the selection predicate is less restrictive (as you provide more and more IDs and a larger percentage of the rows passes the where clause filter).

  • A (person_id, created_at) might help even more as the data for each qualifying person_id is already sorted. The final sort operation on create_at can in theory be optimized to exploit the partially sorted result tuples. However, not all database engines support such optimizations and they are not worth if you have only a few results tuples per person_id. The benefit of adding created_at to the index is larger if you have many entries for a given person_id (lets say more than hundreds to thousands).

  • Adding more columns to an index is not always a good idea as you increase the size of the index and its maintenance cost. If create-at is part of an index, an update to this column also triggers an index update on top of the update to the base relation.

Upvotes: 1

Related Questions