Nurzhan Nogerbek
Nurzhan Nogerbek

Reputation: 5246

How to speed up query with DISTINCT in PostgreSQL?

As you can see I have pretty simple SQL statement:

SELECT DISTINCT("CITY" || ' | '  || "AREA" || ' | ' || "REGION") AS LOCATION
FROM youtube

The youtube table which I use in query has ~ 25 million records. The query takes a very long time to complete (~25 seconds). I'm trying to speed up the request.

I create an index as shown below but my query higher still takes the same time to complete. Whatdid I do wrong? By the way, is it better to use “partitioning” in my case?

CREATE INDEX location_index ON youtube ("CITY", "AREA", "REGION")

EXPLAIN returns:

Unique (cost=5984116.71..6111107.27 rows=96410 width=32)
-> Sort (cost=5984116.71..6047611.99 rows=25398112 width=32)
   Sort Key: ((((("CITY" || ' | '::text) || "AREA") || ' | '::text) || "REGION"))
   -> Seq Scan on youtube (cost=0.00..1037365.24 rows=25398112 width=32) 

@george-joseph QUERY PLAN of your script:

enter image description here

Upvotes: 1

Views: 1491

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1270933

Indexes should be able to help. Try writing the query as:

SELECT DISTINCT ON (city, area, region) "CITY" || ' | '  || "AREA" || ' | ' || "REGION") AS LOCATION
FROM youtube
ORDER BY city, area, region;

This can take advantage of an index on (city, area, region).

Upvotes: 0

Laurenz Albe
Laurenz Albe

Reputation: 247920

Neither an index nor partitioning can help you here.

Since city, area and region are (probably) closely correlated, the number of result rows will be much less than PostgreSQL estimates, because it assumes columns to be independent from each other.

So you should create extended statistics on those columns, a new feature introduced in PostgreSQL v10:

CREATE STATISTICS youtube_stats (ndistinct)
   ON "CITY", "AREA", "REGION" FROM youtube;

ANALYZE youtube;

Now PostgreSQL has a better idea of how many different groups there are.

Then give the query a lot of memory so that it can get a hash with all these groups into memory. Then it can use a hash aggregate rather than sorting the rows:

SET work_mem = '1GB';

You may not need all that much memory; experiment to find a more reasonable limit.

Then try the query from George Joseph's answer:

SELECT x."CITY" || ' | '  || x."AREA" || ' | ' || x."REGION" AS location
FROM (SELECT DISTINCT "CITY", "AREA", "REGION"
      FROM youtube) AS x;

Upvotes: 5

George Joseph
George Joseph

Reputation: 5932

Since you got an index on the columns, how does the query plan look like if you were to do as follows

SELECT x.city || ' | '  || x.area || ' | ' || x.region
FROM (SELECT DISTINCT city, area, region
      FROM youtube) x 

Upvotes: 1

Related Questions