ZX2XZX2
ZX2XZX2

Reputation: 43

Efficiency problem querying postgresql table

I have the following PostgreSQL table with about 67 million rows, which stores the EOD prices for all the US stocks starting in 1985:

                       Table "public.eods"
 Column |         Type          | Collation | Nullable | Default 
--------+-----------------------+-----------+----------+---------
 stk    | character varying(16) |           | not null | 
 dt     | date                  |           | not null | 
 o      | integer               |           | not null | 
 hi     | integer               |           | not null | 
 lo     | integer               |           | not null | 
 c      | integer               |           | not null | 
 v      | integer               |           |          | 
Indexes:
    "eods_pkey" PRIMARY KEY, btree (stk, dt)
    "eods_dt_idx" btree (dt)

I would like to query efficiently the table above based on either the stock name or the date. The primary key of the table is stock name and date. I have also defined an index on the date column, hoping to improve performance for queries that retrieve all the records for a specific date.

Unfortunately, I see a big difference in performance for the queries below. While getting all the records for a specific stock takes a decent amount of time to complete (2 seconds), getting all the records for a specific date takes much longer (about 56 seconds). I have tried to analyze these queries using explain analyze, and I have got the results below:

explain analyze select * from eods where stk='MSFT';
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on eods  (cost=169.53..17899.61 rows=4770 width=36) (actual time=207.218..2142.215 rows=8364 loops=1)
   Recheck Cond: ((stk)::text = 'MSFT'::text)
   Heap Blocks: exact=367
   ->  Bitmap Index Scan on eods_pkey  (cost=0.00..168.34 rows=4770 width=0) (actual time=187.844..187.844 rows=8364 loops=1)
         Index Cond: ((stk)::text = 'MSFT'::text)
 Planning Time: 577.906 ms
 Execution Time: 2143.101 ms
(7 rows)
explain analyze select * from eods where dt='2010-02-22';
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using eods_dt_idx on eods  (cost=0.56..25886.45 rows=7556 width=36) (actual time=40.047..56963.769 rows=8143 loops=1)
   Index Cond: (dt = '2010-02-22'::date)
 Planning Time: 67.876 ms
 Execution Time: 56970.499 ms
(4 rows)

I really cannot understand why the second query runs 28 times slower than the first query. They retrieve a similar number of records, they both seem to be using an index. So could somebody please explain to me why this difference in performance, and can I do something to improve the performance of the queries that retrieve all the records for a specific date?

Upvotes: 4

Views: 65

Answers (3)

Pavel Stehule
Pavel Stehule

Reputation: 45770

There can be more issues - so it is hard to say where is a problem. Index scan should be usually faster, than bitmap heap scan - if not, then there can be following problems:

  • unhealthy index - try to run REINDEX INDEX indexname
  • bad statistics - try to run ANALYZE tablename
  • suboptimal state of table - try to run VACUUM tablename
  • too low, or to high setting of effective_cache_size
  • issues with IO - some systems has a problem with high random IO, try to increase random_page_cost

Investigation what is a issue is little bit alchemy - but it is possible - there are only closed set of very probably issues. Good start is

  • VACUUM ANALYZE tablename
  • benchmark your IO if it is possible (like bonie++)

Upvotes: 2

Laurenz Albe
Laurenz Albe

Reputation: 246308

To find the difference, you'll probably have to run EXPLAIN (ANALYZE, BUFFERS) on the query so that you see how many blocks are touched and where they come from.

I can think of two reasons:

  1. Bad statistics that make PostgreSQL believe that dt has a high correlation while it has not. If the correlation is low, a bitmap index scan is often more efficient.

    To see if that is the problem, run

    ANALYZE eods;
    

    and see if that changes the execution plans chosen.

  2. Caching effects: perhaps the first query finds all required blocks already cached, while the second doesn't.

At any rate, it might be worth experimenting to see if a bitmap index scan would be cheaper for the second query:

SET enable_indexscan = off;

Then repeat the query.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269613

I would guess that this has to do with the data layout. I am guessing that you are loading the data by stk, so the rows for a given stk are on a handful of pages that pretty much only contain that stk.

So, the execution engine is only reading about 25 pages.

On the other hand, no single page contains two records for the same date. When you read by date, you have to read about 7,556 pages. That is, about 300 times the number of pages.

The scaling must also take into account the work for loading and reading the index. This should be about the same for the two queries, so the ratio is less than a factor of 300.

Upvotes: 4

Related Questions