Dinho
Dinho

Reputation: 724

How to count duplicates based on group by as well as see if values are present in the same field values?

I have a table that contains ~1mil rows that looks like so

GROUP  |  ID  

A        AED12
A        6D04K
A        AED12 
B        VFR54
B        VFR54
B        9KMN2
B        AED12 
C        9KMN2
C        9KMN2
C        VFR54

I would like to have an output that counts the number of duplicates that is grouped by 'GROUP' WHILE also seeing which 'IDs' are present across multiple groups (reference 'AED12' that is in both GROUP A and B).

I am using this query for first question of duplicates across each group

select GROUP, ID, count(ID)
from TABLE1
group by GROUP, ID
having count(ID) > 1
;

output of query above

GROUP  |  ID   | COUNT

A        AED12     2
B        VFR54     2
C        9KMN2     2

I would like to sum the number of duplicates per group (how many dups are in Group A, B, etc.) which would like so (made up #s for example purposes)

GROUP | Sum of Dups IDs
A         2134
B         23321
C         11235

While also identifying which IDs are present in more than one group and count how many are present across the groups. Output would be something like a case when field that says is ID present in more than 1 group, if so 'true', if not - 'false'. Still thinking about how to count/group what the output would look like since there are million plus rows..

Any help would be much appreciated - new to Snowflake.

Upvotes: 2

Views: 1613

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269503

You can count the number of different groups an id appears in using window functions and then aggregate. Here is an example counting singletons and pairs:

select group,
       count(*) as num_elements,
       count_if(num_groups = 1) as num_onesies,
       count_if(num_groups = 2) as num_twosies
from (select t.*,
             count(*) over (partition by id) as num_groups
      from table1 t
     ) t
group by group;

EDIT:

If you have duplicates, it is probably best to remove them before doing the above processing:

select group,
       count(*) as num_elements,
       count_if(num_groups = 1) as num_onesies,
       count_if(num_groups = 2) as num_twosies
from (select t.*,
             count(*) over (partition by id) as num_groups
      from (select distinct group, id
            from table1 t
           ) t
     ) t
group by group

Upvotes: 2

Related Questions