Reputation: 5246
As you can see I have pretty simple SQL statement:
SELECT DISTINCT("CITY" || ' | ' || "AREA" || ' | ' || "REGION") AS LOCATION
FROM youtube
The youtube
table which I use in query has ~ 25 million records. The query takes a very long time to complete (~25 seconds). I'm trying to speed up the request.
I create an index as shown below but my query higher still takes the same time to complete. Whatdid I do wrong? By the way, is it better to use “partitioning” in my case?
CREATE INDEX location_index ON youtube ("CITY", "AREA", "REGION")
EXPLAIN
returns:
Unique (cost=5984116.71..6111107.27 rows=96410 width=32)
-> Sort (cost=5984116.71..6047611.99 rows=25398112 width=32)
Sort Key: ((((("CITY" || ' | '::text) || "AREA") || ' | '::text) || "REGION"))
-> Seq Scan on youtube (cost=0.00..1037365.24 rows=25398112 width=32)
@george-joseph QUERY PLAN
of your script:
Upvotes: 1
Views: 1491
Reputation: 1270933
Indexes should be able to help. Try writing the query as:
SELECT DISTINCT ON (city, area, region) "CITY" || ' | ' || "AREA" || ' | ' || "REGION") AS LOCATION
FROM youtube
ORDER BY city, area, region;
This can take advantage of an index on (city, area, region)
.
Upvotes: 0
Reputation: 247920
Neither an index nor partitioning can help you here.
Since city
, area
and region
are (probably) closely correlated, the number of result rows will be much less than PostgreSQL estimates, because it assumes columns to be independent from each other.
So you should create extended statistics on those columns, a new feature introduced in PostgreSQL v10:
CREATE STATISTICS youtube_stats (ndistinct)
ON "CITY", "AREA", "REGION" FROM youtube;
ANALYZE youtube;
Now PostgreSQL has a better idea of how many different groups there are.
Then give the query a lot of memory so that it can get a hash with all these groups into memory. Then it can use a hash aggregate rather than sorting the rows:
SET work_mem = '1GB';
You may not need all that much memory; experiment to find a more reasonable limit.
Then try the query from George Joseph's answer:
SELECT x."CITY" || ' | ' || x."AREA" || ' | ' || x."REGION" AS location
FROM (SELECT DISTINCT "CITY", "AREA", "REGION"
FROM youtube) AS x;
Upvotes: 5
Reputation: 5932
Since you got an index on the columns, how does the query plan look like if you were to do as follows
SELECT x.city || ' | ' || x.area || ' | ' || x.region
FROM (SELECT DISTINCT city, area, region
FROM youtube) x
Upvotes: 1