Oscar Mederos
Oscar Mederos

Reputation: 29803

What indexes do I need to speed up AND/OR SQL queries

Let's assume I have a table named customer like this:

+----+------+----------+-----+
| id | name | lastname | age |
+----+------+----------+-----+
| .. | ...  |   ....   | ... |

and I need to perform the following query:

SELECT * FROM customer WHERE ((name = 'john' OR lastname = 'doe') AND age = 21)

I'm aware of how single and multi-column indexes work, so I created these ones:

(name, age)
(lastname, age)

Is that all the indexes I need?

The above condition can be rephrased as:

... WHERE ((name = 'john' AND age = 21) OR (lastname = 'doe' AND age = 21)

but I'm not sure how smart RDBMS are, and if those indexes are the correct ones

Upvotes: 1

Views: 92

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656241

Your approach is reasonable. Two factors are essential here:

  1. Postgres can combine multiple indexes very efficiently with bitmap index scans.

  2. B-tree index usage is by far most effective when only leading columns of the index are involved.

Test case

If you don't have enough data to measure tests, you can always whip up a quick test case like this:

CREATE TABLE customer (id int, name text, lastname text, age int);

INSERT INTO customer
SELECT g
     , left(md5('foo'::text || g%500) , 3 + ((g%5)^2)::int)
     , left(md5('bar'::text || g%1000), 5 + ((g%5)^2)::int)
     , ((random()^2) * 100)::int
FROM   generate_series(1, 30000) g; -- 30k rows for quick test case

For your query (reformatted):

SELECT *
FROM   customer
WHERE (name = 'john' OR lastname = 'doe')
AND    age = 21;

I would go with

CREATE INDEX customer_age_name_idx ON customer (age, name);
CREATE INDEX customer_age_lastname_idx ON customer (age, lastname);

However, depending on many factors, a single index with all three columns and age as first may be able to deliver similar performance. The rule of thumb is to create as few indexes as possible and as many as necessary.

CREATE INDEX customer_age_lastname_name_idx ON customer (age, lastname, name);

The check on (age, name) is potentially slower in this case, but depending on selectivity of the first column it may not matter much.

Updated SQL Fiddle.

Why age first in the index?

This is not very important and needs deeper understanding to explain. But since you ask ...

The order of columns doesn't matter for the 2-column indexes customer_age_name_idx and customer_age_lastname_idx. Details and a test-case:

I still put age first to stay consistent with the 3rd index I suggested customer_age_lastname_name_idx, where the order of columns does matter in multiple ways:

Most importantly, both your predicates (age, name) and (age, lastname) share the column age. B-tree indexes are (by far) most effective on leading columns, so putting age first benefits both.

And, less importantly, but still relevant: the size of the index is smaller this way due to data type characteristics, alignment, padding and page layout of index pages.

age is a 4-byte integer and must be aligned at multiples of 4 bytes in the data page. text is of variable length and has no alignment restrictions. Putting the integer first or last is more efficient due to the rules of "column tetris". I added another index on (lastname, age, name) (age in the middle!) to the fiddle just to demonstrate it's ~ 10 % bigger. No space lost to additional padding, which results in a smaller index. And size matters.

For the same reasons it would be better to reorder columns in the demo table like this: (id, age, name, lastname). If you want to learn why, start here:

Everything I wrote is for the case at hand. If you have other queries / other requirements, the resulting strategy may change.

UNION query equivalent?

Note that a UNION query may or may not return the same result. It folds duplicate rows, which your original does not. Even if you don't have complete duplicates in your table, you may still see this effect with a subset of columns in the SELECT list. Do not blindly substitute with a UNION query. It's not going to be faster anyway.

Upvotes: 1

Ricardo Peres
Ricardo Peres

Reputation: 14525

Turn the OR into two queries UNIONed:

SELECT * FROM Customer WHERE Age = 21 AND Name = 'John'
UNION
SELECT * FROM Customer WHERE Age = 21 AND LastName = 'Doe'

Then create an index over (Age, Name) and another over (Age, LastName).

Upvotes: 0

Related Questions