bm1729
bm1729

Reputation: 2375

Efficiently selecting distinct (a, b) from big table

I have a table with around 54 million rows in a Postgres 9.6 DB and would like to find all distinct pairs of two columns (there are around 4 million such values). I have an index over the two columns of interest:

create index ab_index on tbl (a, b)

What is the most efficient way to get such pairs? I have tried:

select a,b
from tbl
where a>$previouslargesta
group by a,b
order by a,b
limit 1000

And also:

select distinct(a,b)
from tbl
where a>previouslargesta
order by a,b
limit 1000

Also this recursive query:

with recursive t AS (
  select min(a) AS a from tbl
  union all
  select (select min(a) from tickets where a > t.a)
  FROM t)
select a FROM t

But all are slooooooow.

Is there a faster way to get this information?

Upvotes: 1

Views: 3322

Answers (2)

Stefanos Zilellis
Stefanos Zilellis

Reputation: 611

I cannot guarantee for performance at Postgres, but this is a technique i had used on sql server in a similar case and proven faster than others:

get distinct A into a temp a

get distinct B into a temp b

cross a and b temps to Cartesian into a temp abALL

rank the abALL (optionally)

create a view myview as select top 1 a,b from tbl (your_main_table)

join temp abALL with myview into a temp abCLEAN

rank abCLEAN here if you havent rank above

Upvotes: 0

Erwin Brandstetter
Erwin Brandstetter

Reputation: 658172

Your table has 54 million rows and ...

there are around 4 million such values

7,4 % of all rows is a high percentage, an index can mostly only help by providing pre-sorted data, ideally in an index-only scan. There are more sophisticated techniques for smaller result sets (see below), and there are much faster ways for paging which returns much fewer rows at a time (see below) but for the general case a plain DISTINCT may be among the fastest:

SELECT DISTINCT a, b  -- *no* parentheses
FROM   tbl;
-- ORDER  BY a, b      -- ORDER BY wasn't not mentioned as requirement ...

Don't confuse it with DISTINCT ON, which would require parentheses. See:

The B-tree index ab_index you have on (a, b) is already the best index for this. It has to be scanned in its entirety, though. The challenge is to have enough work_mem to process all in RAM. With standard settings it occupies at least 1831 MB on disk, typically more with some bloat. If you can afford it, run the query with a work_mem setting of 2 GB (or more) in your session. See:

SET work_mem = '2 GB';
SELECT DISTINCT a, b ...
RESET work_mem;

A read-only table helps. Else you need aggressive enough VACUUM settings to allow an index-only scan. And some more RAM, yet, would help (with appropriate settings) to keep the index cashed.

Also upgrade to the latest version of Postgres (11.3 as of writing). There have been many improvements for big data.

Paging

If you want to add paging as indicated by your sample query, urgently consider ROW value comparison. See:

SELECT DISTINCT a, b
FROM   tbl
WHERE  (a, b) > ($previous_a, $previous_b)   -- !!!
ORDER  BY a, b
LIMIT  1000;

Recursive CTE

This also may or may not be faster for the general big query as well. For the small subset, it becomes much more attractive:

WITH RECURSIVE cte AS (
   (  -- parentheses required du to LIMIT 1
   SELECT a, b
   FROM   tbl
   WHERE  (a, b) > ($previous_a, $previous_b)   -- !!!
   ORDER  BY a, b
   LIMIT  1
   )
   UNION ALL
   SELECT x.a, x.b
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT t.a, t.b
      FROM   tbl t
      WHERE  (t.a, t.b) > (c.a, c.b) -- lateral reference
      ORDER  BY t.a, t.b
      LIMIT  1
      ) x
   )
TABLE cte
LIMIT 1000;

This can make perfect use of your index and should be as fast as it gets.

Further reading:

For repeated use and no or little write load on the table, consider a MATERIALIZED VIEW, based on one of the above queries - for much faster read performance.

Upvotes: 1

Related Questions