Reputation: 13
Here is a sample of my dataset:
data have;
input v1 v2 $;
cards;
2015 123
2015 5a2
2015 544
2015 546
2015 678
2015 989
2016 123
2016 123
2016 222
2016 434
2016 444
2016 586
2016 675
2016 970
2016 98
2017 123
2017 123
2017 345
2017 544
2017 544
2017 675
;;;;
run;
My goal is to find the rate at which duplicate values occur in 2017 and/or 2015 and 2016. For example, if '123' occurs n times in 2017, I want to count each recurrence in 2017, but I want to count the number of times it occurs in either 2015 or 2016 once. So in the case of the sample data, the rate for '123' would be .5. Because it occurs twice in 2017, and at least once in the other two years, and we have 6 observations for 2017.
I am new to sql, and have been trying to build off of this code, but obviously this doesn't help me when counting across the levels of v1.
proc sql;
select count(distinct v1) as variablewant
from have
group by v1
;
quit;
Thank you.
Edit:
My desired output would be as follows:
2017: .8333
This is because 5 of the 6 values in 2017 occur either in 2017 or one of the two preceding years.
Upvotes: 0
Views: 71
Reputation: 27498
You can perform a reflexive (i.e. self) join to obtain the result. In the case of a left join
, when the right table does not meet the on
condition the any right columns referenced will be null. This fact is used in the case
statement to count the number of v2
's that occurred in prior years.
proc sql;
create table want as
select
year, sum (case when BUTTER.v2 is not null then 1 else 0 end) / count (*) as ratio
from
have as PEANUT
left join
(select distinct v2 from have where year < 2017) as BUTTER
on
PEANUT.v2 = BUTTER.v2
where
year = 2017
group by
year
;
Upvotes: 0