Sachin
Sachin

Reputation: 13

Removing duplicates using various condition in SAS

I have a following data set

Id  Site    Date
1   A   01-01-2021
1   A   02-02-2021
1   A   08-06-2021
1   A   09-06-2021
2   B   03-03-2021
2   B   08-05-2021
2   B   10-07-2021
2   B   15-07-2021

In the above data set I want to remove duplicates based on several condition- 1.If date is less than 08JUNE2021 AND Site is A then keep the id with most recent date

2.If date is greater than 08June2021 and site is A then keep id with most recent date

3.If date is less than 09JULY2021 and site is B then keep id with most recent date

4.If date is greater than 09JULY2021 and site is B keep id with most recent date

And after the condition data set which I want is given below

Id  site    Date
1   A   02-02-2021
1   A   09-06-2021
2   B   08-05-2021
2   B   15-07-2021

I had tried the code for the data

proc sql; create table want as select * from (select ,count() as count, case when count()>1 and Date>"07JUN2021"d then 1 when count()>1 and Date<"07JUN2021"d then 2 else count(*) end as flag from z group by id) where site="A"; quit; run;

Upvotes: 1

Views: 720

Answers (1)

Joe
Joe

Reputation: 63424

One approach I like for this is to create a 'group' variable that allows you to identify groupings. Here I show doing that with an informat. This is a good way to do it if you always only have two groups ('before' date and 'after' date); if you have more than two groups, there are other options. Also note I assume you mean less than and (not less than); if you actually want to explicitly exclude June 8th records for Site A, then this will need some adaptation.

data have;
input Id  Site $ Date :ddmmyy10.;
format date date9.;
datalines;
1   A   01-01-2021
1   A   02-02-2021
1   A   08-06-2021
1   A   09-06-2021
2   B   03-03-2021
2   B   08-05-2021
2   B   10-07-2021
2   B   15-07-2021
;;;;
run;

*Create an informat that maps each site to the pivot date;
proc format;
  invalue site_datei
  'A' = '08JUN2021'd
  'B' = '09JUL2021'd
  ;
run;

* Now compare the date to the pivot date for that site, and assign group based on that result;
data for_sort;
  set have;
  group = ifn(date lt input(site,site_datei.),1,2);
run;

Now you can do many things to get the max value per group - here is a SQL solution since you seem to prefer that.

proc sql;
  select id, site, group, max(date) format=date9. from for_sort group by id, site, group;
quit;

This could have been done in a single SQL query, but I recommend against that unless there's significant performance considerations (and in that case, I even MORE strongly recommend against SQL!).

Upvotes: 1

Related Questions