Reputation: 2375
I have a table with around 54 million rows in a Postgres 9.6 DB and would like to find all distinct pairs of two columns (there are around 4 million such values). I have an index over the two columns of interest:
create index ab_index on tbl (a, b)
What is the most efficient way to get such pairs? I have tried:
select a,b
from tbl
where a>$previouslargesta
group by a,b
order by a,b
limit 1000
And also:
select distinct(a,b)
from tbl
where a>previouslargesta
order by a,b
limit 1000
Also this recursive query:
with recursive t AS (
select min(a) AS a from tbl
union all
select (select min(a) from tickets where a > t.a)
FROM t)
select a FROM t
But all are slooooooow.
Is there a faster way to get this information?
Upvotes: 1
Views: 3322
Reputation: 611
I cannot guarantee for performance at Postgres, but this is a technique i had used on sql server in a similar case and proven faster than others:
get distinct A into a temp a
get distinct B into a temp b
cross a and b temps to Cartesian into a temp abALL
rank the abALL (optionally)
create a view myview as select top 1 a,b from tbl (your_main_table)
join temp abALL with myview into a temp abCLEAN
rank abCLEAN here if you havent rank above
Upvotes: 0
Reputation: 658172
Your table has 54 million rows and ...
there are around 4 million such values
7,4 % of all rows is a high percentage, an index can mostly only help by providing pre-sorted data, ideally in an index-only scan. There are more sophisticated techniques for smaller result sets (see below), and there are much faster ways for paging which returns much fewer rows at a time (see below) but for the general case a plain DISTINCT
may be among the fastest:
SELECT DISTINCT a, b -- *no* parentheses
FROM tbl;
-- ORDER BY a, b -- ORDER BY wasn't not mentioned as requirement ...
Don't confuse it with DISTINCT ON
, which would require parentheses. See:
The B-tree index ab_index
you have on (a, b)
is already the best index for this. It has to be scanned in its entirety, though. The challenge is to have enough work_mem
to process all in RAM. With standard settings it occupies at least 1831 MB on disk, typically more with some bloat. If you can afford it, run the query with a work_mem
setting of 2 GB (or more) in your session. See:
SET work_mem = '2 GB';
SELECT DISTINCT a, b ...
RESET work_mem;
A read-only table helps. Else you need aggressive enough VACUUM
settings to allow an index-only scan. And some more RAM, yet, would help (with appropriate settings) to keep the index cashed.
Also upgrade to the latest version of Postgres (11.3 as of writing). There have been many improvements for big data.
If you want to add paging as indicated by your sample query, urgently consider ROW value comparison. See:
SELECT DISTINCT a, b
FROM tbl
WHERE (a, b) > ($previous_a, $previous_b) -- !!!
ORDER BY a, b
LIMIT 1000;
This also may or may not be faster for the general big query as well. For the small subset, it becomes much more attractive:
WITH RECURSIVE cte AS (
( -- parentheses required du to LIMIT 1
SELECT a, b
FROM tbl
WHERE (a, b) > ($previous_a, $previous_b) -- !!!
ORDER BY a, b
LIMIT 1
)
UNION ALL
SELECT x.a, x.b
FROM cte c
CROSS JOIN LATERAL (
SELECT t.a, t.b
FROM tbl t
WHERE (t.a, t.b) > (c.a, c.b) -- lateral reference
ORDER BY t.a, t.b
LIMIT 1
) x
)
TABLE cte
LIMIT 1000;
This can make perfect use of your index and should be as fast as it gets.
Further reading:
For repeated use and no or little write load on the table, consider a MATERIALIZED VIEW
, based on one of the above queries - for much faster read performance.
Upvotes: 1