mad-a
mad-a

Reputation: 173

SQL (Impala) selecting a count of distinct values in one column for each id

Given a table like the following:

+----+-----------+------------------+
| id | diagnosis | diagnosis_age    |
+----+-----------+------------------+
| 1  | 315.32000 | 2.18430371791803 |
| 1  | 315.32000 | 3.18430371791803 |
| 1  | 800.00000 | 2.18430371791803 |
| 2  | 315.32000 | 5.64822705794013 |
| 3  | 800.00000 | 5.68655778752176 |
| 3  | 120.12000 | 5.70572315231258 |
| 4  | 315.32000 | 5.72488851710339 |
| 4  | 315.32000 | 5.74405388189421 |
| 5  | 120.12000 | 5.7604813374292  |
| 6  | 315.32000 | 5.77993740687426 |
+----+-----------+------------------+

Each id can occur many times, each id can have many diagnosis, each id can have many diagnosis_age, and each row is unique.

I'm trying to find which id has the most distinct diagnoses.

Something like:

SELECT id, COUNT(*) AS diagnosis_count 
    FROM diagnoses
        GROUP BY id
    ORDER BY diagnosis_count DESC
LIMIT 5;

Returns:

+------+-----------------+
| id   | diagnosis_count |
+------+-----------------+
| 8    | 8557            |
| 99   | 7828            |
| 127  | 7592            |
| 39   | 6812            |
| 2173 | 6492            |
+------+-----------------+

But this doesn't really solve my problem. This works if you define each row as being a diagnosis (which I suppose you could, since each row is unique). If instead, you wanted to get a count of the distinct diagnosis codes for each id (id 1 would have 2 distinct diagnosis codes despite having 3 rows), or alternatively, get a count of the distinct diagnosis_ages for each id (id 4 would have 2 distinct diagnosis_age), how would you go about doing that?

Thanks in advance.

Upvotes: 1

Views: 1160

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269743

Use count(distinct):

SELECT id, COUNT(distinct diagnosis) AS diagnosis_count 
FROM diagnoses
GROUP BY id
ORDER BY diagnosis_count DESC;

Upvotes: 1

Related Questions