milestrong
milestrong

Reputation: 257

PostgreSQL ORDER BY taking extremely long

I have a query which looks something like this

SELECT DISTINCT
  COALESCE(fa.id, fb.id) AS id,
  COALESCE(fa.d_id, fb.d_id) AS d_id,
  COALESCE(fa.name, fb.name) AS name,
  COALESCE(fa.disabled, fb.disabled) AS disabled,
  COALESCE(fa.deleted, fb.deleted) AS deleted
FROM (
  SELECT * from table WHERE name LIKE '%'
  AND d_id IS NULL AND deleted = false
) fa
FULL JOIN (
  SELECT * from table WHERE name LIKE '%'
  AND d_id = 1 AND deleted = false
) fb ON fa.name = fb.name
ORDER BY name;

where id is the primary key of the table and name is the actual value. d_id is the id of a user.

Basically, the table has a huge list of names (about 400k+), and if it doesn't have a d_id, it means it was automatically generated by the system. If it has a d_id, then it means it was user generated.

What the query is supposed to return is the entire list of default system names plus the names which that certain user has added (in this case, all names generated by the user with a d_id of 1). This is why it performs a full join on itself.

My problem here is that running the query takes too long (about 30000~40000ms on my local psql shell, and ~15000 on live). I've ran EXPLAIN ANALYZE and got this

Unique  (cost=8240.78..8272.13 rows=2090 width=42) (actual time=27591.662..28742.062 rows=418018 loops=1)
  ->  Sort  (cost=8240.78..8246.01 rows=2090 width=42) (actual time=27591.659..28504.606 rows=418018 loops=1)
        Sort Key: (COALESCE(table.name, table_1.name)), (COALESCE(table.id, table_1.id)), (COALESCE(table.d_id, table_1.d_id)), (COALESCE(table.disabled, table_1.disabled)), (COALESCE(table.deleted, table_1.deleted))
        Sort Method: external merge  Disk: 13680kB
        ->  Hash Full Join  (cost=8.45..8125.53 rows=2090 width=42) (actual time=11.037..1479.053 rows=418018 loops=1)
              Hash Cond: (table.name = table_1.name)
              ->  Seq Scan on table  (cost=0.00..8109.23 rows=2090 width=27) (actual time=0.048..799.822 rows=418018 loops=1)
                    Filter: ((d_id IS NULL) AND (NOT deleted) AND (name ~~ '%'::citext))
              ->  Hash  (cost=8.44..8.44 rows=1 width=27) (actual time=10.970..10.970 rows=0 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 8kB
                    ->  Index Scan using table__d_id__name__idx on table table_1  (cost=0.42..8.44 rows=1 width=27) (actual time=10.970..10.970 rows=0 loops=1)
                          Index Cond: (d_id = 1)
                          Filter: ((NOT deleted) AND (name ~~ '%'::citext))

And while I can't understand it fully, I can tell that most of the reason why it takes too long is in the sort (ORDER BY) function.

My indexes are as follows:

Indexes:
    "table_pkey" PRIMARY KEY, btree (id)
    "table__d_id__name__idx" UNIQUE, btree (d_id, name)
    "table__name__idx" gist (name gist_trgm_ops)
    "table__id__idx" btree (id)

I've tried using different indexes, refactoring the query and playing around with the code, but it still takes just as long. I've tried removing all but the primary key index, and the query somehow sped up to ~23000ms.

Also, in the app, the user can select a letter which will return all results which start with that letter, and the query would look like WHERE name LIKE 'a%'. Despite also having tens of thousands of results, specifying a starting letter drastically decreases the load time to about 1000-2000ms.

I'm aiming to make the query load in as little as 5000 to 10000ms. Any help or suggestions will be greatly appreciated!

Upvotes: 2

Views: 1918

Answers (2)

Laurenz Albe
Laurenz Albe

Reputation: 247950

The large sort is the problem.

You could get rid of the sort if you didn't use DISTINCT.
I see that in your case the rows are all different anyway, since there are 418018 rows before and after applying Unique. Think carefully if duplicates can really happen in your case or if you can do away with DISTINCT and solve the problem that way.

If you need DISTINCT, you should increase work_mem at least for this query, so that the sort can happen in memory rather than spilling to disk. This would improve the performance considerably.

Upvotes: 0

FuzzyTree
FuzzyTree

Reputation: 32402

I think you can use an or instead of a full join. distinct on (name) only selects unique names and order by name, d_id selects system names before user names.

select distinct on (name)
    id, d_id, name, disabled, deleted
from table
where deleted = false
and (
    d_id is null
    or d_id = 1
)
order by name, d_id

Upvotes: 2

Related Questions