Reputation: 13
I have a following data set
Id Site Date
1 A 01-01-2021
1 A 02-02-2021
1 A 08-06-2021
1 A 09-06-2021
2 B 03-03-2021
2 B 08-05-2021
2 B 10-07-2021
2 B 15-07-2021
In the above data set I want to remove duplicates based on several condition- 1.If date is less than 08JUNE2021 AND Site is A then keep the id with most recent date
2.If date is greater than 08June2021 and site is A then keep id with most recent date
3.If date is less than 09JULY2021 and site is B then keep id with most recent date
4.If date is greater than 09JULY2021 and site is B keep id with most recent date
And after the condition data set which I want is given below
Id site Date
1 A 02-02-2021
1 A 09-06-2021
2 B 08-05-2021
2 B 15-07-2021
I had tried the code for the data
proc sql; create table want as select * from (select ,count() as count, case when count()>1 and Date>"07JUN2021"d then 1 when count()>1 and Date<"07JUN2021"d then 2 else count(*) end as flag from z group by id) where site="A"; quit; run;
Upvotes: 1
Views: 720
Reputation: 63424
One approach I like for this is to create a 'group' variable that allows you to identify groupings. Here I show doing that with an informat. This is a good way to do it if you always only have two groups ('before' date and 'after' date); if you have more than two groups, there are other options. Also note I assume you mean less than and (not less than); if you actually want to explicitly exclude June 8th records for Site A, then this will need some adaptation.
data have;
input Id Site $ Date :ddmmyy10.;
format date date9.;
datalines;
1 A 01-01-2021
1 A 02-02-2021
1 A 08-06-2021
1 A 09-06-2021
2 B 03-03-2021
2 B 08-05-2021
2 B 10-07-2021
2 B 15-07-2021
;;;;
run;
*Create an informat that maps each site to the pivot date;
proc format;
invalue site_datei
'A' = '08JUN2021'd
'B' = '09JUL2021'd
;
run;
* Now compare the date to the pivot date for that site, and assign group based on that result;
data for_sort;
set have;
group = ifn(date lt input(site,site_datei.),1,2);
run;
Now you can do many things to get the max value per group - here is a SQL solution since you seem to prefer that.
proc sql;
select id, site, group, max(date) format=date9. from for_sort group by id, site, group;
quit;
This could have been done in a single SQL query, but I recommend against that unless there's significant performance considerations (and in that case, I even MORE strongly recommend against SQL!).
Upvotes: 1