marie20
marie20

Reputation: 783

Generate combinations and identify which occurs the most

I have a table with 3 columns. I need to generate combinations (always in 3's) of the values (in "tests"column) of the table that were performed based on a particular "diagnosis". However, it is possible that a particular diagnosis has 2 or lesser tests, in which case the logic would still output that combination, albeit with 2 values. With reference to the table below, for each cust_id there is a "diagnosis" column based on which the "tests" were performed. Now for every diagnosis value group, I need to generate unique combinations of the corresponding values in the "tests" column. Note, the combinations should always be with 3 values ( where values >= 3), but for diagnosis which have lesser than 3 values (1 or 2), the corresponding combination should still be output (with the available 1 or 2 values and substitute Null in place of the unavailable value).

PATIENT:

pat_id | diagnosis | tests

1001 | Thyroid | CAT
1001 | Thyroid | MRI
1001 | Thyroid | Blood

1001 | Tonsil  | CAT
1001 | Tonsil  | MRI
1001 | Tonsil  | Blood
1001 | Tonsil  | RAPID

1002 | Pneumonia | MRI
1002 | Pneumonia | Eliza

1003 | Bronchitis | X-Ray

So, for pat_id = '1001' and diagnosis = 'Thyroid' we see that "tests" have 3 distinct values. Hence, only 1 unique combination is possible i.e. {CAT, MRI, Blood}. Similarly, for pat_id = '1001' and diagnosis = 'Tonsil', we see there are 4 distinct values in "test" column. Hence, there will be 4 combinations i.e. {CAT, MRI, Blood}, {CAT, MRI, RAPID}, {MRI, Blood, RAPID} & {CAT, blood, RAPID}. For pat_id = '1002' only two unique values exists. Hence the combination will be only 1 i.e. {MRI, Eliza}. Similarly for, pat_id = '1003' has only 1 value i.e. X-Ray, hence the output should be {X-Ray} for '1003'.

Like this, I need to generate similar combinations for all diagnosis values in a group and finally, identify the unique combination that occurs the max number of times in that table. The output should be the combination that occurs the most in the table.

So far, the sql below returns all the combinations that have 3 or more values. But it is failing to output the ones which have lesser than 3 values. Meaning, 1002 & 1003 are not being output since they have lesser than 3 values but needs to be output. The solution needs to handle such cases as well.

select p1.pat_id, p1.diagnosis, p1.tests, p2.tests, p3.tests
from patient p1 join
     patient p2
     on p1.pat_id = p2.pat_id and p1.diagnosis = p2.diagnosis and
        p1.tests < p2.tests join
     patient p3
     on p2.pat_id = p3.pat_id and p2.diagnosis = p3.diagnosis and
        p2.tests < p3.tests ;

Also please indicate how I can identify which combination occurs the most. Thanks.

Upvotes: 1

Views: 39

Answers (2)

Paul Spiegel
Paul Spiegel

Reputation: 31772

You can use your query with left joins to allow the second and the third tests to be NULL. But then you will need to remove rows with NULLs for groups with two or more tests. You can use correlated (dependent) COUNT(*) subquery to achieve that:

select
  p1.pat_id,
  p1.diagnosis,
  p1.tests as test1,
  p2.tests as test2,
  p3.tests as test3
from patient p1
left join patient p2
  on  p2.diagnosis = p1.diagnosis
  and p2.pat_id = p1.pat_id
  and p2.tests > p1.tests
left join patient p3
  on  p3.diagnosis = p1.diagnosis
  and p3.pat_id = p1.pat_id
  and p3.tests > p2.tests
where
  case (
    select count(*)
    from patient p
    where p.diagnosis = p1.diagnosis
      and p.pat_id = p1.pat_id
  )
    when 1 then true
    when 2 then p2.tests is not null
    else        p3.tests is not null
  end
order by p1.pat_id, p1.diagnosis

Result:

| pat_id | diagnosis  | test1 | test2 | test3 |
| ------ | ---------- | ----- | ----- | ----- |
| 1001   | Thyroid    | Blood | CAT   | MRI   |
| 1001   | Tonsil     | CAT   | MRI   | RAPID |
| 1001   | Tonsil     | Blood | MRI   | RAPID |
| 1001   | Tonsil     | Blood | CAT   | MRI   |
| 1001   | Tonsil     | Blood | CAT   | RAPID |
| 1002   | Pneumonia  | Eliza | MRI   |       |
| 1003   | Bronchitis | X-Ray |       |       |

View on DB Fiddle

To order distinct combinations by the number of their occurences, you just need to modify it to a GROUP BY ... ORDER BY COUNT(*) query:

select 
  p1.tests as test1,
  p2.tests as test2,
  p3.tests as test3,
  count(*) as cnt
from patient p1
left join patient p2
  on  p2.diagnosis = p1.diagnosis
  and p2.pat_id = p1.pat_id
  and p2.tests > p1.tests
left join patient p3
  on  p3.diagnosis = p1.diagnosis
  and p3.pat_id = p1.pat_id
  and p3.tests > p2.tests
where
  case (
    select count(*)
    from patient p
    where p.diagnosis = p1.diagnosis
      and p.pat_id = p1.pat_id
  )
    when 1 then true
    when 2 then p2.tests is not null
    else        p3.tests is not null
  end
group by p1.tests, p2.tests, p3.tests
order by cnt desc

Result:

| test1 | test2 | test3 | cnt |
| ----- | ----- | ----- | --- |
| Blood | CAT   | MRI   | 2   |
| CAT   | MRI   | RAPID | 1   |
| Blood | MRI   | RAPID | 1   |
| Eliza | MRI   |       | 1   |
| X-Ray |       |       | 1   |
| Blood | CAT   | RAPID | 1   |

View on DB Fiddle

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269543

I think you want left join and group by:

select p1.tests, p2.tests, p3.tests, count(*)
from patient p1 left join
     patient p2
     on p1.pat_id = p2.pat_id and p1.diagnosis = p2.diagnosis and
        p1.tests < p2.tests left join
     patient p3
     on p2.pat_id = p3.pat_id and p2.diagnosis = p3.diagnosis and
        p2.tests < p3.tests
group by p1.tests, p2.tests, p3.tests
order by count(*) desc;

I am not sure if diagnosis should also be part of the result set. As you describe the results, it does not seem to be, but to me it would make sense.

Upvotes: 0

Related Questions