soerface
soerface

Reputation: 6813

Distinct values by multiple columns after already applied GROUP BY

Basically, I have the following query (actually more complex, but I think this simplification is ok):

SELECT a, b, x
FROM table

output:

 a | b | x
-----------
 1 | 2 | 34
 1 | 3 | 35
 1 | 3 | 36
 1 | 4 | 37
 2 | 3 | 38
 2 | 3 | 39
 2 | 4 | 40
 3 | 4 | 41
 3 | 5 | 42

To count the number of occurrence of each "pair of a and b", I'm using GROUP BY:

SELECT a, b, COUNT(x) AS count
FROM table
GROUP BY a, b
ORDER BY count

output:

 a | b | count
--------------
 1 | 2 | 1
 1 | 4 | 1
 2 | 4 | 1
 3 | 4 | 1
 3 | 5 | 1
 1 | 3 | 2
 2 | 3 | 2

What bothers me is the multiple occurence of a and b. I would like to keep the "count" as it is, but remove every following row, if a or b was already in a previous row. It would be a nice to have, if it would also remove a row, if the value of "a" appeared in a previous row as "b" and vice versa.

Preferred expected output:

 a | b | count
--------------
 1 | 2 | 1
 1 | 4 | 1    <- should not be in output since we had a=1
 2 | 4 | 1    <- should not be in output since we had b=
 3 | 4 | 1    
 3 | 5 | 1    <- should not be in output since we had a=3
 1 | 3 | 2    <- should not be in output since we had a=1 / a=3
 2 | 3 | 2    <- should not be in output since we had b=2 / a=3

Therefore, this:

 a | b | count
--------------
 1 | 2 | 1
 3 | 4 | 1    

Alternative expected output, if the above would be too complex:

 a | b | count
--------------
 1 | 2 | 1
 1 | 4 | 1    <- should not be in output since we had a=1
 2 | 4 | 1    
 3 | 4 | 1    <- should not be in output since we had b=4
 3 | 5 | 1    
 1 | 3 | 2    <- should not be in output since we had a=1
 2 | 3 | 2    <- should not be in output since we had a=2

Therefore, this:

 a | b | count
--------------
 1 | 2 | 1
 2 | 4 | 1    
 3 | 5 | 1    

Upvotes: 1

Views: 104

Answers (2)

nik
nik

Reputation: 134

This query will give you the desired output.

DECLARE @id INT = 1,
        @a INT,
        @b INT,
        @count INT

DECLARE @tbl TABLE
(
    id INT IDENTITY(1,1),
    a INT,
    b INT,
    count INT
)

INSERT INTO @tbl
SELECT a, b, COUNT(1) AS COUNT FROM dbo.myTable
GROUP BY a, b
ORDER BY COUNT,a,b

SELECT @count = COUNT(1) FROM @tbl

WHILE @id <= @count
BEGIN
    SELECT TOP 1 @a = a,@b = b FROM @tbl WHERE id = @id

    IF EXISTS(SELECT 1 FROM @tbl WHERE id < @id AND (a = @a OR b = @b))
        DELETE @tbl WHERE id = @id

    SET @id += 1
END

SELECT a,b,count FROM @tbl

Check it on SQLFiddle

Upvotes: 0

John Bollinger
John Bollinger

Reputation: 180058

This is rather a mess of a problem, but here's something to consider:

SELECT a, b, count
FROM (
    SELECT a, b, count,
          rank() over (partition by b order by count, a) as b_rank
    FROM (
        SELECT a, b, count,
          rank() over (partition by a order by count, b) as a_rank
        FROM (
            SELECT a, b, COUNT(*) AS count
            FROM t
            GROUP BY a, b
            ORDER BY count
          ) pc
      ) pc2
    WHERE a_rank < 3
  ) pc3
WHERE b_rank = 1

Each a value will appear at most twice in the results, but b values will be unique. Some b values appearing in low-count pairs may not be reflected in the results. There is a trade-off between possible duplication of a and the number of b values that may be missed altogether: allowing more duplicates of a (by changing to, e.g, WHERE a_rank < 4) reduces the number of b values that may be missed.

Upvotes: 2

Related Questions