postgres how to choose what to index, what matters, what will make a difference?

Question

I have a postgres table with ~90M rows. It has the following columns all type text except as noted:

action_id (text - 25 character)
exchangeid (text - 25 character)
flow (text - 25 character)
amt (int 8)
source_rowid (text - 25 character)
dest_rowid (text - 25 character)
hierarchy_level (int8)
date (timestamp)
source (text - 5 character)
destination (text - 5 character)

I recently ran a query to count the distinct values in the 'action_id' column:

SELECT count( distinct action_id) FROM myTable  (should return 90)

When that query failed to complete after 1/2 hour, I decided to index the column and rerun the query which then took 27 minutes. This surprised me as I expected a query on an indexed column to be much faster than that given it's only 90M rows.

My questions are:

does indexing improve query speed or just efficiency?
what should I index and why?
if I index one column does that help in a query that puts conditions on multiple columns that column might be paired with eg. 'WHERE colA = x AND colB = y' where colA is indexed and colB is not; is the query faster if just one column is indexed?

Note that everything I have found so far from googling the topic shows me how to index or tells me I should with a vague promise of 'faster'.

Update: Running query with 'explain analyze buffers' gives

Update 2: running VACUUM reduced the query time to 11 min.

Update 3: the index created for the action_id column was simply: CREATE INDEX expanded_flows_eng_id ON public.expanded_flows USING btree (action_id). It is 711M, text_ops. It never seems to get used.

Schwern · Accepted Answer

action_id is text, alphanumeric string

Your query is already pretty slow with an integer, about 35 seconds. Text will slow things down considerably. It takes longer to compare, and consumes more disk and memory.

We can improve this schema in a couple ways.

If the set of possible action_id's is fixed you could use an enum. These act like text fields, but they're stored as numbers.

create type actions as enum (
  'sit', 'stand', 'stop', 'drop', 'roll'
)

create table expanded_flows (
  action_id actions not null
);

insert into expanded_flows (action_id) values ('stop'), ('drop'), ('roll');

The upside is it requires no change to your existing queries. The downside is the set is fixed. Every time you want to add a new action you need to alter the schema.

alter type actions add value 'scream';

Alternatively, make actions their own table with a simple integer ID. Store the text ID in the table.

create table actions (
  id bigint generated always as identity,
  name text not null
);

create table expanded_flows (
  action_id bigint not null references actions(id)
);

insert into actions (id, name) values (1, 'scream'), (2, 'shout'), (3, 'doubt');

insert into expanded_flows (action_id) values (3), (1), (2);

This is probably the safest and most flexible approach, though it does change how expanded_flows is queried.

Previous answer before I realized action_id was text.

It's possible Postgres's statistics it uses to plan queries are wrong. Bring them up to date with vacuum analyze yourtable.

Postgres might not use the index. If it thinks there's a low cardinality (there's only going to be a few distinct IDs) it will assume scanning the whole table sequentially will be faster than scanning the keys of the index.

You can give it a hint by turning off sequential scans.

create table expanded_flows (
  action_id integer not null
);

insert into expanded_flows
select a.*
from generate_series(1,90) a
cross join generate_series(1,900000);

create index action_idx on expanded_flows(action_id);

-- before analyzing it will seq scan taking 35 seconds.
explain analyze SELECT count( distinct action_id ) FROM expanded_flows ;

                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1370908.00..1370908.01 rows=1 width=8) (actual time=35026.029..35026.030 rows=1 loops=1)
   ->  Seq Scan on expanded_flows  (cost=0.00..1168408.00 rows=81000000 width=4) (actual time=0.039..7520.377 rows=81000000 loops=1)
 Planning Time: 0.060 ms
 Execution Time: 35026.059 ms

vacuum analyze expanded_flows;

-- after analyzing with seqscans on, still seq scans taking 35 seconds.
explain analyze SELECT count( distinct action_id ) FROM expanded_flows ;

                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1370910.60..1370910.61 rows=1 width=8) (actual time=35559.958..35559.959 rows=1 loops=1)
   ->  Seq Scan on expanded_flows  (cost=0.00..1168410.08 rows=81000208 width=4) (actual time=0.057..7235.677 rows=81000000 loops=1)
 Planning Time: 0.092 ms
 Execution Time: 35559.996 ms
(4 rows)

set enable_seqscan = off;

-- analyzed, seqscans off, it uses the index taking... about 35 seconds.
explain analyze SELECT count( distinct action_id ) FROM expanded_flows ;

                                                                         QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2307148.21..2307148.22 rows=1 width=8) (actual time=35033.378..35033.379 rows=1 loops=1)
   ->  Index Only Scan using action_idx on expanded_flows  (cost=0.57..2104647.69 rows=81000208 width=4) (actual time=1.315..8392.896 rows=81000000 loops=1)
         Heap Fetches: 0
 Planning Time: 0.089 ms
 Execution Time: 35033.419 ms

The planner was right, the index is no faster than the sequential scan. It takes about 35 seconds which still seems pretty slow for reading 90 keys out of a btree, but it's not 30 minutes. Your millage may vary.

(2018 MacBook Pro, 16 GB, and (critically) an SSD).

postgres how to choose what to index, what matters, what will make a difference?

Answers (1)

Related Questions