Version 2: How can I write a postgres sql query that returns M distinct values of a certain column with M + unknown overall records returned?

Question

This is a slightly harder version of what I posted earlier. I don't want to edit the original to ruin a perfectly good answer to the question I had at the time.

Suppose a table named tau.

tau:
A | B | C
----------    
2 - 1 - red
3 - 1 - rod
4 - 1 - rope
6 - 5 - red
7 - 5 - rap
8 - 5 - rod
9 - 10 -rod
11- 10 -road
12- 13 -rud

Column A is the primary key. It will be unique. Column B is a foreign key. In my database no integer key is ever the same across tables. Column C is not a key.

This table will have a lot of rows. There are many other columns, say column C for instance, that are indexed for search.

I want exactly M distinct values from column B. Lets say M = 2

Importantly, I am using column B in the expression (it's indexed too!) to determine which B's to return.

Normally, I would go

select distinct B From tau Where C like 'r_d' AND B < 13 Order By B Desc Limit 2

and I get

B
-----
10
5

This is the current state of affairs, now. But I want to upgrade to a new scenario:

When the expression on B C is satisfied and adds the record to the results pile (C like 'r_d' AND B < 13), I want to return column A as well as column B, while maintaining the restriction of 2 at most distinct column B values.

Another important condition is the solution must work for expressions in the where clause that return true for possibly different values of C.

Behold, the results I want.

A | B
------
9 - 10
8 - 5
6 - 5

The trouble is, the actual limit of records returned can be greater than M, or 2 in the example. I don't really care how many records come through, as long as there are only M distinct values of B. How would I write a query to accomplish this in Postgresql?

Vladimir Baranov · Accepted Answer

DENSE_RANK should do what you need. Here is SQL Fiddle. Put your M into WHERE R.rnk <= 2 instead of 2.

Rather than starting with M distinct values from B and trying to figure out how to add missing values from column A to the final result, I'm thinking about this problem from the other side. We have some search conditions that limit the whole table to some intermediary result set:

SELECT *
FROM tau
WHERE C LIKE 'r_d' AND B < 13

This intermediary result set has all columns from the table, nothing is aggregated yet. We just need to filter this set further and leave only M distinct values of the B column. DENSE_RANK assigns a number (without gaps) to each group of B values, so we can use it in the final filter.

SELECT
    A, B
FROM
    (
        SELECT
            A
            ,B
            ,C
            ,DENSE_RANK() OVER(ORDER BY B DESC) AS rnk
        FROM tau
        WHERE C LIKE 'r_d' AND B < 13
    ) AS R
WHERE R.rnk <= 2
ORDER BY B DESC, A DESC;

result set

Second variant using LATERAL JOIN

SQL Fiddle with both variants. In the second variant we are finding M distinct values of B at first. Then for each found value we run the main query with extra filter for this B value. If there are millions of different values of B, then filtering to specific value should be efficient, provided there is an index on B. If there are millions of rows, but number of total distinct values of B is small, it should be slow.

WITH
CTE
AS
(
    select distinct B 
    From tau 
    Where C like 'r_d' AND B < 13 
    Order By B Desc 
    Limit 2
)
SELECT
  T.A, T.B, T.C
FROM
  CTE
  INNER JOIN LATERAL
  (
      SELECT tau.A, tau.B, tau.C
      FROM tau
      WHERE tau.B = CTE.B AND tau.C like 'r_d' AND tau.B < 13 
  ) AS T ON true
ORDER BY T.B DESC, T.A DESC;

Version 2: How can I write a postgres sql query that returns M distinct values of a certain column with M + unknown overall records returned?

Answers (1)

Related Questions