How to count duplicate values across levels

Question

Here is a sample of my dataset:

data have;  
input v1 v2 $;
cards; 
2015  123
2015  5a2
2015  544
2015  546
2015  678
2015  989
2016  123
2016  123
2016  222
2016  434
2016  444
2016  586
2016  675
2016  970
2016  98
2017  123
2017  123
2017  345
2017  544
2017  544
2017  675
;;;;
run;

My goal is to find the rate at which duplicate values occur in 2017 and/or 2015 and 2016. For example, if '123' occurs n times in 2017, I want to count each recurrence in 2017, but I want to count the number of times it occurs in either 2015 or 2016 once. So in the case of the sample data, the rate for '123' would be .5. Because it occurs twice in 2017, and at least once in the other two years, and we have 6 observations for 2017.

I am new to sql, and have been trying to build off of this code, but obviously this doesn't help me when counting across the levels of v1.

 proc sql;
    select count(distinct v1) as variablewant 
            from have
                group by v1
;
quit;

Thank you.

Edit:

My desired output would be as follows:

2017: .8333

This is because 5 of the 6 values in 2017 occur either in 2017 or one of the two preceding years.

Richard · Accepted Answer

You can perform a reflexive (i.e. self) join to obtain the result. In the case of a left join, when the right table does not meet the on condition the any right columns referenced will be null. This fact is used in the case statement to count the number of v2's that occurred in prior years.

proc sql;
  create table want as
  select 
    year, sum (case when BUTTER.v2 is not null then 1 else 0 end) / count (*) as ratio
  from
    have as PEANUT
  left join
    (select distinct v2 from have where year < 2017) as BUTTER
  on
    PEANUT.v2 = BUTTER.v2
  where 
    year = 2017
  group by
    year
  ;

How to count duplicate values across levels

Answers (1)

Related Questions