Denis Arharov
Denis Arharov

Reputation: 111

PostgreSQL index reduces data size but makes the query slower

I have a PostgreSQL table with 7.9GB of JSON data. My goal is to perform aggregations on the whole table on a daily basis, the aggregation results will later be used for analytical reports in Google Data Studio.

One of the queries I'm trying to run looks as follows:

explain analyze
select tender->>'procurementMethodType' as procurement_method,
       tender->>'status' as tender_status,
       sum(cast(tender->'value'->>'amount' as decimal)) as total_expected_value
from tenders
group by 1,2

The query plan and execution time are the following:

Query without any indexes

The problem is that the database has to scan through all the 7.9GB of data, even though the query uses only 3 field values out of approximately 100. So I decided to create the following index:

create index on tenders((tender->>'procurementMethodType'), (tender->>'status'), (cast(tender->'value'->>'amount' as decimal)))

The size of the index is 44MB, which is much smaller than the size of the entire table, so I expect that the query should be much faster. However, when I run the same query with the index created, I get the following result:

The same query with index

The query with index is slower! How can this be possible?

EDIT: the table itself contains two columns: the ID column and the jsonb data column:

create table tenders (
   id uuid primary key,
   tender jsonb
)
                                                                  

Upvotes: 2

Views: 230

Answers (1)

jjanes
jjanes

Reputation: 44373

The code that does an index only scan is somewhat deficient in this case. It thinks it needs "tender" to be available in the index in order to fulfill the demand for cast(tender->'value'->>'amount' as decimal). It fails to realize that having cast(tender->'value'->>'amount' as decimal) itself in the index obviates the need for "tender" itself. So it is doing a regular index scan, in which it has to jump from the index to the table for every row it will return, to fish out "tender" and then compute cast(tender->'value'->>'amount' as decimal). This means it is jumping all over the table doing random io, which is much slower than just reading the table sequentially and then doing a sort.

You could try an index on ((tender->>'procurementMethodType'), (tender->>'status'), tender). This index would be huge (as large as the table) if it can even be built, but would take away the need for a sort.

But your current query finishes in 30 seconds. For a query that is only run once a day, does it really need to be faster than this?

Upvotes: 1

Related Questions