Reputation: 707
How can I optimize a query whose WHERE conditions include a check for user_id = X OR user_id IN (some subquery that might return no results)
In my example below, queries 1 and 2 are both extremely fast (< 1 ms), but query 3, which is simply an OR of the conditions in queries 1 and 2, is much slower (50 ms)
Can somebody please explain why query 3 is so slow, and in general what types of query optimization strategies should I be pursuing to avoid this problem? I realize the subquery in my example could easily be eliminated, but in real life sometimes subqueries seem like the least complicated way to get the data I want.
relevant code and data:
posts data https://dl.dropbox.com/u/4597000/StackOverflow/sanitized_posts.csv
users data https://dl.dropbox.com/u/4597000/StackOverflow/sanitized_users.csv
# from the shell:
# > createdb test
CREATE TABLE posts (
id integer PRIMARY KEY NOT NULL,
created_by_id integer NOT NULL,
created_at integer NOT NULL
);
CREATE INDEX index_posts ON posts (created_by_id, created_at);
CREATE INDEX index_posts_2 ON posts (created_at);
CREATE TABLE users (
id integer PRIMARY KEY NOT NULL,
login varchar(50) NOT NULL
);
CREATE INDEX index_users ON users (login);
COPY posts FROM '/path/to/sanitized_posts.csv' DELIMITERS ',' CSV;
COPY users FROM '/path/to/sanitized_users.csv' DELIMITERS ',' CSV;
-- queries:
-- query 1, fast:
EXPLAIN ANALYZE SELECT * FROM posts WHERE created_by_id = 123 LIMIT 100;
-- query 2, fast:
EXPLAIN ANALYZE SELECT * FROM posts WHERE created_by_id IN (SELECT id FROM users WHERE login = 'nobodyhasthislogin') LIMIT 100;
-- query 3, slow:
EXPLAIN ANALYZE SELECT * FROM posts WHERE created_by_id = 123 OR created_by_id IN (SELECT id FROM users WHERE login = 'nobodyhasthislogin') LIMIT 100;
Upvotes: 2
Views: 4640
Reputation: 2746
Most of the time in this particular query is related to an index scan. Here is a query goes at it from a different angle to avoid this, but should return equivalent results.
SELECT posts.* FROM users JOIN posts on posts.created_by_id=users.id WHERE users.id=123 or login='nobodyhasthislogin'
This selects from the users table, doing the filter once, and then joins posts onto that.
I realize that the question is about tips for optimization, not really this specific query. To answer that, my suggestion is to run EXPLAIN ANALYZE
and read up on interpreting the results, - this answer was helpful to me.
Upvotes: 0
Reputation: 52336
How about:
EXPLAIN ANALYZE
SELECT *
FROM posts
WHERE created_by_id IN (
SELECT 123
UNION ALL
SELECT id FROM
users WHERE
login = 'nobodyhasthislogin') LIMIT 100;
Upvotes: 0
Reputation: 2354
Split the query (edited):
SELECT * FROM (
SELECT * FROM posts p WHERE p.created_by_id = 123
union
SELECT * FROM posts p
WHERE
EXISTS ( SELECT TRUE FROM users WHERE id = p.created_by_id AND login = 'nobodyhasthislogin')
) p
LIMIT 100;
Upvotes: 2