Reputation: 990
EXPLAIN SELECT a.name, m.name FROM Casting c JOIN Movie m ON c.m_id = m.m_id JOIN Actor a ON a.a_id = c.a_id AND c.a_id < 50;
Output
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=26.20..18354.49 rows=1090 width=27) (actual time=0.240..5.603 rows=1011 loops=1)
-> Nested Loop (cost=25.78..12465.01 rows=1090 width=15) (actual time=0.236..4.046 rows=1011 loops=1)
-> Bitmap Heap Scan on casting c (cost=25.35..3660.19 rows=1151 width=8) (actual time=0.229..1.059 rows=1011 loops=1)
Recheck Cond: (a_id < 50)
Heap Blocks: exact=989
-> Bitmap Index Scan on casting_a_id_index (cost=0.00..25.06 rows=1151 width=0) (actual time=0.114..0.114 rows=1011 loops=1)
Index Cond: (a_id < 50)
-> Index Scan using movie_pkey on movie m (cost=0.42..7.64 rows=1 width=15) (actual time=0.003..0.003 rows=1 loops=1011)
Index Cond: (m_id = c.m_id)
-> Index Scan using actor_pkey on actor a (cost=0.42..5.39 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=1011)
Index Cond: (a_id = c.a_id)
Planning time: 0.334 ms
Execution time: 5.672 ms
(13 rows)
I am trying to understand how query planner works? I am able to understand the process it choose, but I am not getting why ? Can someone explain query optimizer choices (choice of query processing algorithms, join order) in these queries based on parameters like query selectivity and cost models or anything that effects choice? Also why there is use of Recheck Cond, after index scan ?
Upvotes: 2
Views: 772
Reputation: 246383
There are two reasons why there has to be a Bitmap Heap Scan:
PostgreSQL has to check whether the rows found are visible for the current transaction or not. Remember that PostgreSQL keeps old row versions in the table until VACUUM
removes them. This visibility information is not stored in the index.
If work_mem
is not large enough to contain a bitmap with one bit per table row, PostgreSQL uses one bit per table page, which loses some information. The PostgreSQL needs to check the lossy blocks to see which of the rows in the block really satisfy the condition.
You can see this when you use EXPLAIN (ANALYZE, BUFFERS)
, then PostgreSQL will show if there were lossy matches, see this example on rextester:
-> Bitmap Heap Scan on t (cost=177.14..4719.43 rows=9383 width=0)
(actual time=2.130..144.729 rows=10001 loops=1)
Recheck Cond: (val = 10)
Rows Removed by Index Recheck: 738586
Heap Blocks: exact=646 lossy=3305
Buffers: shared hit=1891 read=2090
-> Bitmap Index Scan on t_val_idx (cost=0.00..174.80 rows=9383 width=0)
(actual time=1.978..1.978 rows=10001 loops=1)
Index Cond: (val = 10)
Buffers: shared read=30
I cannot explain the whole of the PostgreSQL optimizer in this answer, but what it does is to try all possible ways to compute the result, estimate how much each one will cost and choose the cheapest plan.
To estimate how big the result set will be, it uses the object definitions and the table statistics, which contain detailed data about how the column values are distributed.
It then calculates how many disk blocks it will have to read sequentially and by random access (I/O cost), and how many tables and index rows and function calls it will have to process (CPU cost) to come up with a grand total. The weights for each of these components in the total can be configured.
Usually the best plan is one that reduces the number of result rows as quickly as possible by applying the most selective condition first. In your case this seems to be casting.a_id < 50
.
Nested loop joins are often preferred if the number of rows in the outer (upper in EXPLAIN
output) table is small.
Upvotes: 1