Postgresql query optimization No inner/outer join allowed

I am given this query to optimize on POSTGRESQL 9.2:

SELECT C.name, COUNT(DISTINCT I.id) AS NumItems, COUNT(B.id)
FROM Categories C INNER JOIN Items I ON(C.id = I.category) 
                  INNER JOIN Bids B ON (I.id = B.item_id)
GROUP BY C.name

As part of my school assignment.

I have created these indexes on respective table: items(category)-->2ndary b+tree, bids(item_id)-->2ndary b+tree, and categories(id)-->primary index here,

The weird part is, PostgreSQL is doing a sequential scan on my Items, Categories, and Bids table, and when i set the enable_seqscan=off, the index search turns out to be more horrendous than the result below.

When I run explain in PostgreSQL this is the result: PLEASE DON'T REMOVE THE INDENTATIONS AS THEY ARE IMPORTANT!

GroupAggregate  (cost=119575.55..125576.11 rows=20 width=23) (actual time=6912.523..9459.431 rows=20 loops=1)
  Buffers: shared hit=30 read=12306, temp read=6600 written=6598
  ->  Sort  (cost=119575.55..121075.64 rows=600036 width=23) (actual time=6817.015..8031.285 rows=600036 loops=1)
        Sort Key: c.name
        Sort Method: external merge  Disk: 20160kB
        Buffers: shared hit=30 read=12306, temp read=6274 written=6272
        ->  Hash Join  (cost=9416.95..37376.03 rows=600036 width=23) (actual time=407.974..3322.253 rows=600036 loops=1)
              Hash Cond: (b.item_id = i.id)
              Buffers: shared hit=30 read=12306, temp read=994 written=992
              ->  Seq Scan on bids b  (cost=0.00..11001.36 rows=600036 width=8) (actual time=0.009..870.898 rows=600036 loops=1)
                    Buffers: shared hit=2 read=4999
              ->  Hash  (cost=8522.95..8522.95 rows=50000 width=19) (actual time=407.784..407.784 rows=50000 loops=1)
                    Buckets: 4096  Batches: 2  Memory Usage: 989kB
                    Buffers: shared hit=28 read=7307, temp written=111
                    ->  Hash Join  (cost=1.45..8522.95 rows=50000 width=19) (actual time=0.082..313.211 rows=50000 loops=1)
                          Hash Cond: (i.category = c.id)
                          Buffers: shared hit=28 read=7307
                          ->  Seq Scan on items i  (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.004..144.554 rows=50000 loops=1)
                                Buffers: shared hit=27 read=7307
                          ->  Hash  (cost=1.20..1.20 rows=20 width=19) (actual time=0.062..0.062 rows=20 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                Buffers: shared hit=1
                                ->  Seq Scan on categories c  (cost=0.00..1.20 rows=20 width=19) (actual time=0.004..0.028 rows=20 loops=1)
                                      Buffers: shared hit=1
Total runtime: 9473.257 ms

See this plan on explain.depesz.com.

I just want to know why this happens, i.e. why indexes make the query horrendously slow compared to sequential scan.

Edit: I think i have managed to uncover a couple of stuff by going through the postgresql documentation. Postgresql decided to do seq scan on some table such as bids and items because it predicted it has to retrieve every single rows in the table (compare the number of rows in the bracket before the actual time and the number of rows in the actual time part). Sequential scan is better in retrieving all rows. Well nothing can be done in that part.

I have created extra index for categories(name), and the result below is what i have. It somehow improved but now hash join is replaced with nested loop. Any clue in why?

GroupAggregate  (cost=0.00..119552.02 rows=20 width=23) (actual time=617.330..7725.314 rows=20 loops=1)
  Buffers: shared hit=178582 read=37473 written=14, temp read=2435 written=436
  ->  Nested Loop  (cost=0.00..115051.55 rows=600036 width=23) (actual time=0.120..6186.496 rows=600036 loops=1)
        Buffers: shared hit=178582 read=37473 written=14, temp read=2109 written=110
        ->  Nested Loop  (cost=0.00..26891.55 rows=50000 width=19) (actual time=0.066..2827.955 rows=50000 loops=1)
              Join Filter: (c.id = i.category)
              Rows Removed by Join Filter: 950000
              Buffers: shared hit=2 read=7334 written=1, temp read=2109 written=110
              ->  Index Scan using categories_name_idx on categories c  (cost=0.00..12.55 rows=20 width=19) (actual time=0.039..0.146 rows=20 loops=1)
                    Buffers: shared hit=1 read=1
              ->  Materialize  (cost=0.00..8280.00 rows=50000 width=8) (actual time=0.014..76.908 rows=50000 loops=20)
                    Buffers: shared hit=1 read=7333 written=1, temp read=2109 written=110
                    ->  Seq Scan on items i  (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.007..170.464 rows=50000 loops=1)
                          Buffers: shared hit=1 read=7333 written=1
        ->  Index Scan using bid_itemid_idx on bids b  (cost=0.00..1.60 rows=16 width=8) (actual time=0.016..0.036 rows=12 loops=50000)
              Index Cond: (item_id = i.id)
              Buffers: shared hit=178580 read=30139 written=13
Total runtime: 7726.392 ms

Have a look on the plan here if it is better.

I have managed to reduce it to 114062.92 by creating index on category(id) and items(category). Postgresql used both of the indexes to get to 114062.92 cost. However, now postgresql is playing game with me by not using the index! why is it so buggy?

Upvotes: 6

Answers (3)

vladr

Reputation: 66721

Note that if the size of bids is systematically and significantly larger than items then it may actually be cheaper to traverse items twice (especially so if items fits in RAM) than to pick off those distinct item IDs from the join result (even if you sort in-memory). Moreover, depending on how the Postgres pipeline happens to pull data from the duplicate tables, there may be limited penalty even under adverse load or memory conditions (this would be a good question that you could ask on pgsql-general.) Take:

SELECT name, IC.cnt, BC.cnt FROM
Categories C,
( SELECT category, count(1) cnt from Items I GROUP BY category ) IC,
( SELECT category, count(1) cnt from Bids B INNER JOIN Items I ON (I.id = B.item_id) GROUP BY category ) BC
WHERE IC.category=C.id AND BC.category=id;

How much cheaper? At least 4x given sufficient caching, i.e. 610ms vs. 2500ms (in-memory sort) with 20 categories, 50k items and 600k bids, and still faster than 2x after a filesystem cache flush on my machine.

Note that the above is not a direct substitute for your original query; for one it assumes that there is a 1:1 mapping between category IDs and names (which could turn out to be a very reasonable assumption; if not, simply SUM(BC.cnt) and SUM(IC.cnt) as you GROUP BY name), but more importantly the per-category item count includes items that have no bids, unlike your original INNER JOIN. If only bid-for item counts are required you can add WHERE EXISTS (SELECT 1 FROM Bids B where item_id=I.id) in the IC subquery; this will traverse Bids a second time as well (in my case that added a ~200ms penalty to the existing ~600ms plan, still well under 2400ms.)

Upvotes: 0

sufleR

Reputation: 2973

From query plans we can see that:
1. Result and categories have 20 records
2. Items with category are 5% of all amount of Items
"Rows Removed by Join Filter: 950000"
"rows=50000" in sequential scan
3. Bids matched is rows=600036 (could you give us total number of bids?)
4. Every category have a bid?

So we want to use indexes on items(category) and bids(item_id). We want also to sort fit in memory.

 select  
   (select name from Categories where id = foo.category) as name, 
   count(foo.id),  
   sum(foo.bids_count)  
 from 
   (select 
      id,  
      category,  
      (select count(item_id) from Bids where item_id = i.id) as bids_count  
    from Items i  
    where category in (select id from Categories)  
      and exists (select 1 from Bids where item_id = i.id)  
   ) as foo  
  group by foo.category  
  order by name

Of course you have to remember that it strictly depends on data in points 1 and 2.

If 4 is true you can remove exists part from query.

Any advises or ideas?

Upvotes: 0

Craig Ringer

Reputation: 324971

Thankyou for posting EXPLAIN output without being asked, and for the EXPLAIN (BUFFERS, ANALYZE).

A significant part of your query's performance issue is likely to be the outer sort plan node, which is doing an on-disk merge sort with a temporary file:

Sort Method: external merge Disk: 20160kB

You could do this sort in memory by setting:

SET work_mem = '50MB';

before running your query. This setting can also be set per-user, per-database or globally in postgresql.conf.

I'm not convinced that adding indexes will be of much benefit as the query is currently structured. It needs to read and join all rows from all three tables, and hash joins are likely to be the fastest way to do so.

I suspect there are other ways to express that query that will use entirely different and more efficient execution strategies, but I'm having a brain-fade about what they might be and don't want to spend the time to make up dummy tables to play around. More work_mem should significantly improve the query as it stands.

Upvotes: 1

Postgresql query optimization No inner/outer join allowed

Answers (3)

Related Questions