Distinct values by multiple columns after already applied GROUP BY

Question

Basically, I have the following query (actually more complex, but I think this simplification is ok):

SELECT a, b, x
FROM table

output:

 a | b | x
-----------
 1 | 2 | 34
 1 | 3 | 35
 1 | 3 | 36
 1 | 4 | 37
 2 | 3 | 38
 2 | 3 | 39
 2 | 4 | 40
 3 | 4 | 41
 3 | 5 | 42

To count the number of occurrence of each "pair of a and b", I'm using GROUP BY:

SELECT a, b, COUNT(x) AS count
FROM table
GROUP BY a, b
ORDER BY count

output:

 a | b | count
--------------
 1 | 2 | 1
 1 | 4 | 1
 2 | 4 | 1
 3 | 4 | 1
 3 | 5 | 1
 1 | 3 | 2
 2 | 3 | 2

What bothers me is the multiple occurence of a and b. I would like to keep the "count" as it is, but remove every following row, if a or b was already in a previous row. It would be a nice to have, if it would also remove a row, if the value of "a" appeared in a previous row as "b" and vice versa.

Preferred expected output:

 a | b | count
--------------
 1 | 2 | 1
 1 | 4 | 1    <- should not be in output since we had a=1
 2 | 4 | 1    <- should not be in output since we had b=
 3 | 4 | 1    
 3 | 5 | 1    <- should not be in output since we had a=3
 1 | 3 | 2    <- should not be in output since we had a=1 / a=3
 2 | 3 | 2    <- should not be in output since we had b=2 / a=3

Therefore, this:

 a | b | count
--------------
 1 | 2 | 1
 3 | 4 | 1

Alternative expected output, if the above would be too complex:

 a | b | count
--------------
 1 | 2 | 1
 1 | 4 | 1    <- should not be in output since we had a=1
 2 | 4 | 1    
 3 | 4 | 1    <- should not be in output since we had b=4
 3 | 5 | 1    
 1 | 3 | 2    <- should not be in output since we had a=1
 2 | 3 | 2    <- should not be in output since we had a=2

Therefore, this:

 a | b | count
--------------
 1 | 2 | 1
 2 | 4 | 1    
 3 | 5 | 1

John Bollinger · Accepted Answer

This is rather a mess of a problem, but here's something to consider:

SELECT a, b, count
FROM (
    SELECT a, b, count,
          rank() over (partition by b order by count, a) as b_rank
    FROM (
        SELECT a, b, count,
          rank() over (partition by a order by count, b) as a_rank
        FROM (
            SELECT a, b, COUNT(*) AS count
            FROM t
            GROUP BY a, b
            ORDER BY count
          ) pc
      ) pc2
    WHERE a_rank < 3
  ) pc3
WHERE b_rank = 1

Each a value will appear at most twice in the results, but b values will be unique. Some b values appearing in low-count pairs may not be reflected in the results. There is a trade-off between possible duplication of a and the number of b values that may be missed altogether: allowing more duplicates of a (by changing to, e.g, WHERE a_rank < 4) reduces the number of b values that may be missed.

Distinct values by multiple columns after already applied GROUP BY

Answers (2)

Related Questions