Jani
Jani

Reputation: 557

Slow index scan with about 10 million rows

I have a table with about 10 million entries, which I'm trying to optimize.

create table houses
(
    id                                serial                          not null
        constraint houses_pkey
            primary key,
    secondary_id                      text                            not null,
    market                            integer                         not null,
    user_id                           uuid                            not null,
    status                            text      default ‘’::text      not null,
    custom                            boolean   default false,
    constraint houses_unique_constraint
        unique (user_id, market, secondary_id)
);

create index houses_user_index
    on houses (user_id);
create index houses_user_market_index
    on houses (user_id, market);
create index houses_user_status_index
    on houses (user_id, status);

I have an use case, where I want to find all distinct non-null user_id and market combinations with given statuses and if any of the entries have their custom flag set. I'm using the following query, but it's very slow. Do you have any ideas what I could optimize here? Thank you!

postgres=# EXPLAIN ANALYZE VERBOSE SELECT DISTINCT user_id, market, bool_or(custom) 
FROM houses WHERE user_id IS NOT NULL 
AND status=ANY(‘{open, sold}‘) GROUP by user_id, market;
                                                                                   QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1694157.78..1695700.38 rows=154260 width=21) (actual time=9574.290..9704.120 rows=809916 loops=1)
   Output: user_id, market, (bool_or(custom))
   ->  Sort  (cost=1694157.78..1694543.43 rows=154260 width=21) (actual time=9574.289..9625.108 rows=809916 loops=1)
         Output: user_id, market, (bool_or(custom))
         Sort Key: houses.user_id, houses.market, (bool_or(houses.custom))
         Sort Method: external sort  Disk: 24544kB
         ->  GroupAggregate  (cost=0.56..1677700.42 rows=154260 width=21) (actual time=0.396..9290.278 rows=809916 loops=1)
               Output: user_id, market, bool_or(custom)
               Group Key: houses.user_id, houses.market
               ->  Index Scan using houses_user_market_index on public.houses  (cost=0.56..1615726.52 rows=8057507 width=21) (actual time=0.350..8647.480 rows=8114889 loops=1)
                     Output: user_id, market, custom
                     Index Cond: (houses.user_id IS NOT NULL)
                     Filter: (houses.status = ANY (‘{open,sold}’::text[]))
                     Rows Removed by Filter: 892609
 Planning time: 0.889 ms
 Execution time: 9729.300 ms
(16 rows)

I have tried adding more indices to cover the custom field as well, but it doesn't seem to make any difference.

Upvotes: 0

Views: 970

Answers (1)

jjanes
jjanes

Reputation: 44157

No matter what, you are summarizing over 8 million rows. You might be able to improve things but don't expect any magic.

First thing to do is to drop the DISTINCT, as the GROUP BY already renders that choice of columns distinct already (though the planner does not seem to know that). But it looks like that will only save 0.5 seconds.

In your existing plan, the index does not provide any usable selectivity. What it does offer is production of the data in an order which suits the GroupAggregate. But it still has to hop all round the table to pull out the additional columns, and I am surprised it finds this an attractive option. Perhaps that is because the table data is highly correlated on user_id, so doing this is mostly visiting the table pages in physical order.

Even if that is the case, it would be better to do an index-only scan, which you can get it to do by having a covering index, which would be on (user_id, market, status, custom). You don't need the INCLUDE feature to have a covering index, so being on v10 is not a problem--you just have to put the columns into the body of the index. It has been recommended to put status earlier in the index, but doing that will wreck the ordering property without providing any meaningful selectivity benefit.

You might get some benefit from parallel execution (but in my case, it is actually a harm not a benefit--maybe due to lousy hardware) by lowering parallel_tuple_cost.

Upvotes: 1

Related Questions