Reputation: 724
I have a table that contains ~1mil rows that looks like so
GROUP | ID
A AED12
A 6D04K
A AED12
B VFR54
B VFR54
B 9KMN2
B AED12
C 9KMN2
C 9KMN2
C VFR54
I would like to have an output that counts the number of duplicates that is grouped by 'GROUP' WHILE also seeing which 'IDs' are present across multiple groups (reference 'AED12' that is in both GROUP A and B).
I am using this query for first question of duplicates across each group
select GROUP, ID, count(ID)
from TABLE1
group by GROUP, ID
having count(ID) > 1
;
output of query above
GROUP | ID | COUNT
A AED12 2
B VFR54 2
C 9KMN2 2
I would like to sum the number of duplicates per group (how many dups are in Group A, B, etc.) which would like so (made up #s for example purposes)
GROUP | Sum of Dups IDs
A 2134
B 23321
C 11235
While also identifying which IDs are present in more than one group and count how many are present across the groups. Output would be something like a case when field that says is ID present in more than 1 group, if so 'true', if not - 'false'. Still thinking about how to count/group what the output would look like since there are million plus rows..
Any help would be much appreciated - new to Snowflake.
Upvotes: 2
Views: 1613
Reputation: 1269503
You can count the number of different groups an id appears in using window functions and then aggregate. Here is an example counting singletons and pairs:
select group,
count(*) as num_elements,
count_if(num_groups = 1) as num_onesies,
count_if(num_groups = 2) as num_twosies
from (select t.*,
count(*) over (partition by id) as num_groups
from table1 t
) t
group by group;
EDIT:
If you have duplicates, it is probably best to remove them before doing the above processing:
select group,
count(*) as num_elements,
count_if(num_groups = 1) as num_onesies,
count_if(num_groups = 2) as num_twosies
from (select t.*,
count(*) over (partition by id) as num_groups
from (select distinct group, id
from table1 t
) t
) t
group by group
Upvotes: 2