user202987
user202987

Reputation: 1246

PostgreSQL - Slow query joining on a VIEW

I'm trying to do a simple join between a table (players) and view (player_main_colors):

SELECT P.*, C.main_color FROM players P
    OUTER LEFT JOIN player_main_colors C USING (player_id)
    WHERE P.user_id=1;

This query is taking ~40 ms.

Here I'm using a nested SELECT on the VIEW instead of the JOIN:

SELECT player_id, main_color FROM player_main_colors
    WHERE player_id IN (
        SELECT player_id FROM players WHERE user_id=1);

This query is also taking ~40 ms.

When I split the query into its 2 pieces, it becomes fast as I would have expected:

SELECT player_id FROM players WHERE user_id=1;

SELECT player_id, main_color FROM player_main_colors
    where player_id in (584, 9337, 11669, 12096, 13651,
        13852, 9575, 23388, 14339, 500, 24963, 25630,
        8974, 13048, 11904, 10537, 20362, 9216, 4747, 25045);

These queries take ~0.5 ms each.

So why are the above queries with the JOIN or sub-SELECT so horribly slow and how can I fix it?

Here are some details about my tables and the view:

CREATE TABLE users (
    user_id INTEGER PRIMARY KEY,
    ...
)

CREATE TABLE players (
    player_id INTEGER PRIMARY KEY,
    user_id INTEGER NOT NULL REFERENCES users (user_id),
    ...
)

CREATE TABLE player_data (
    player_id INTEGER NOT NULL REFERENCES players (player_id),
    game_id INTEGER NOT NULL,
    color INTEGER NOT NULL,
    PRIMARY KEY (player_id, game_id, color),
    active_time INTEGER DEFAULT 0,
    ...
)

CREATE VIEW player_main_colors AS
    SELECT DISTINCT ON (1) player_id, color as main_color
        FROM player_data
        GROUP BY player_id, color
        ORDER BY 1, MAX(active_time) DESC

It seems like it must be a problem with my VIEW...?

Here's an EXPLAIN ANALYZE for the nested SELECT query above:

Merge Semi Join  (cost=1877.59..2118.00 rows=6851 width=8) (actual time=32.946..38.471 rows=25 loops=1)
   Merge Cond: (player_data.player_id = players.player_id)
   ->  Unique  (cost=1733.19..1801.70 rows=13701 width=12) (actual time=32.651..37.209 rows=13419 loops=1)
         ->  Sort  (cost=1733.19..1767.45 rows=13701 width=12) (actual time=32.646..34.918 rows=16989 loops=1)
               Sort Key: player_data.player_id, (max(player_data.active_time))
               Sort Method: external merge  Disk: 376kB
               ->  HashAggregate  (cost=654.79..791.80 rows=13701 width=12) (actual time=13.636..19.051 rows=17311 loops=1)
                     ->  Seq Scan on player_data  (cost=0.00..513.45 rows=18845 width=12) (actual time=0.005..1.801 rows=18845 loops=1)
   ->  Sort  (cost=144.40..144.53 rows=54 width=8) (actual time=0.226..0.230 rows=54 loops=1)
         Sort Key: players.player_id
         Sort Method: quicksort  Memory: 19kB
         ->  Bitmap Heap Scan on players  (cost=4.67..142.85 rows=54 width=8) (actual time=0.035..0.112 rows=54 loops=1)
               Recheck Cond: (user_id = 1)
               ->  Bitmap Index Scan on test  (cost=0.00..4.66 rows=54 width=0) (actual time=0.023..0.023 rows=54 loops=1)
                     Index Cond: (user_id = 1)
 Total runtime: 39.279 ms

As for indexes, I only have 1 relevant one on top of the default ones for my primary keys:

CREATE INDEX player_user_idx ON players (user_id);

I'm currently using PostgreSQL 9.2.9.

Update:

I've reduced the problem below. See the difference between IN (4747) and IN (SELECT 4747).

Slow:

>> explain analyze SELECT * FROM (
          SELECT player_id, color 
            FROM player_data
            GROUP BY player_id, color
            ORDER BY MAX(active_time) DESC
       ) S
       WHERE player_id IN (SELECT 4747);

 Hash Join  (cost=1749.99..1975.37 rows=6914 width=8) (actual time=30.492..34.291 rows=4 loops=1)
   Hash Cond: (player_data.player_id = (4747))
   ->  Sort  (cost=1749.95..1784.51 rows=13827 width=12) (actual time=30.391..32.655 rows=17464 loops=1)
         Sort Key: (max(player_data.active_time))
         Sort Method: external merge  Disk: 376kB
         ->  HashAggregate  (cost=660.71..798.98 rows=13827 width=12) (actual time=12.714..17.249 rows=17464 loops=1)
               ->  Seq Scan on player_data  (cost=0.00..518.12 rows=19012 width=12) (actual time=0.006..1.898 rows=19012 loops=1)
   ->  Hash  (cost=0.03..0.03 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 1kB
         ->  HashAggregate  (cost=0.02..0.03 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=1)
               ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
 Total runtime: 35.015 ms
(12 rows)

Time: 35.617 ms

Fast:

>> explain analyze SELECT * FROM (
          SELECT player_id, color 
            FROM player_data
            GROUP BY player_id, color
            ORDER BY MAX(active_time) DESC
       ) S
       WHERE player_id IN (4747);

 Subquery Scan on s  (cost=17.40..17.45 rows=4 width=8) (actual time=0.035..0.035 rows=4 loops=1)
   ->  Sort  (cost=17.40..17.41 rows=4 width=12) (actual time=0.034..0.034 rows=4 loops=1)
         Sort Key: (max(player_data.active_time))
         Sort Method: quicksort  Memory: 17kB
         ->  GroupAggregate  (cost=0.00..17.36 rows=4 width=12) (actual time=0.020..0.027 rows=4 loops=1)
               ->  Index Scan using player_data_pkey on player_data  (cost=0.00..17.28 rows=5 width=12) (actual time=0.014..0.019 rows=5 loops=1)
                     Index Cond: (player_id = 4747)
 Total runtime: 0.080 ms
(8 rows)

Time: 0.610 ms

Upvotes: 7

Views: 13991

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656586

You have both GROUP BY and DISTINCT ON in your VIEW definition. That's like shooting a dead man. Simplify:

CREATE VIEW player_main_colors AS
SELECT DISTINCT ON (player_id)
       player_id, color AS main_color
FROM   player_data
ORDER  BY player_id, active_time DESC NULLS LAST;

NULLS LAST is necessary to be equivalent to your original because active_time can be NULL according to your table definition. Should be faster. But there is more. For best performance create these indexes:

CREATE INDEX players_up_idx ON players (user_id, player_id);
CREATE INDEX players_data_pa_idx ON player_data
    (player_id, active_time DESC NULLS LAST, color);

Use DESC NULLS LAST in the index as well to match the sort order of the query. Or you declare the column player_data.active_time as NOT NULL and simplify all.

It's LEFT OUTER JOIN not OUTER LEFT JOIN - or just omit the noise word OUTER:

SELECT *  -- equivalent here to "p.*, c.main_color"
FROM   players p
LEFT   JOIN player_main_colors c USING (player_id)
WHERE  p.user_id = 1;

I'll assume there are lots of rows in player_data for each player_id. And you are only selecting a few player_id. JOIN LATERAL would be fastest for this case, but you need Postgres 9.3 or later for that. In pg 9.2 you can achieve a similar effect with correlated subqueries:

CREATE VIEW player_main_colors AS
SELECT player_id
    , (SELECT color 
       FROM   player_data
       WHERE  player_id = p.player_id
       ORDER  BY active_time DESC NULLS LAST
       LIMIT  1) AS main_color
FROM   players p
ORDER  BY 1;  -- optional

Subtle difference to your original view: this includes players without any entries in player_data. You could try the same query as above based on the new view. But I would not use a view at all. This is probably fastest:

SELECT *
    , (SELECT color 
       FROM   player_data
       WHERE  player_id = p.player_id
       ORDER  BY active_time DESC NULLS LAST
       LIMIT  1) AS main_color
FROM   players p
WHERE  p.user_id = 1;

Detailed explanation:

Upvotes: 9

Mark Roberts
Mark Roberts

Reputation: 460

So, the reason for this behavior is that the query planner has limitations. In the specific bind param case, the query planner is able to make specific plans based on the query it can see and analyze. However, when things happen via joins and subselects, there's much less visibility into what will happen. It makes the optimizer use a more "generic" plan - one that in this case is significantly slower.

The right answer for you appears to be making two selects. Perhaps a better answer might be to denormalize "main_color" onto your player table and update it at regular intervals.

Upvotes: 0

Related Questions