ckp
ckp

Reputation: 593

Removing rows between two values in SAS

For the following data I am trying to filter rows, of each group ID, based on these conditions:

Can we create the flag variable as shown in my want dataset? so that I can filter on Flag='Y'?

Have

  ID     Type     Date         Value
  001     F       1/2/2018      Y
  001     B       1/3/2018
  001     B       1/4/2018      Y
  001     B       1/5/2018
  001     B       1/6/2018
  001     F       1/6/2018      Y
  001     B       1/6/2018      
  001     B       1/7/2018
  001     B       1/8/2018      Y
  001     B       1/8/2018
  001     B       1/9/2018
  002     F       1/2/2018      Y
  002     B       1/3/2018
  002     B       1/4/2018

Want

  ID     Type     Date         Value   Flag
  001     F       1/2/2018      Y       Y
  001     B       1/3/2018              Y
  001     B       1/4/2018      Y       Y 
  001     B       1/5/2018
  001     B       1/6/2018
  001     F       1/6/2018      Y       Y
  001     B       1/6/2018              Y
  001     B       1/7/2018              Y
  001     B       1/8/2018      Y       Y 
  001     B       1/8/2018
  001     B       1/9/2018
  002     F       1/2/2018      Y       Y
  002     B       1/3/2018              Y
  002     B       1/4/2018              Y

I tried to do the following

data F;
set have;
where Type='F';run;

data B;
 set have;
 where Type='B';run;

 proc sql;
  create table all as select
  a.* from B as b
  inner join F as f
  on a.id=b.id
  and b.date >= a.date;
quit;

This includes all the rows from my have dataset. Any help is much appreciated.

Upvotes: 3

Views: 1523

Answers (3)

Richard
Richard

Reputation: 27526

The criteria for computing the state of a row as part of a contiguous sub-group (call it a 'run' of rows) within group ID are relatively simple, but a compromised state might occur or be indicated if some funny cases of data occur:

  • two or more B Y before a F Y (extra 'run ending')
  • two or more F Y before a B Y ('run starting' within a run)
  • first row in group not F Y ('run starting' not first in group)
data want(drop=run_:);
  SET have;
  BY id;

  run_first = (type='F' and value='Y');
  run_final = (type='B' and value='Y');

  * set flag state at criteria for start of contiguous sub-group criteria;
  run_flag + run_first;

  if first.id and NOT run_flag then
    put 'WARNING: first row in group ' id= ' is not F Y, this may be incorrect';

  if run_flag > 1 and run_first then 
    put 'WARNING: an additional F Y before a B Y at row ' _n_;

  if run_flag then
    OUTPUT;

  if run_flag = 0 and run_final then 
    put 'WARNING: an additional B Y before a F Y at row ' _n_;

  * reset flag at criteria for contiguous sub-group;
  if last.id or run_final then 
    run_flag = 0;
run;

Upvotes: 1

Pin
Pin

Reputation: 33

I have a solution but it is not the most elegant (and might not cover corner cases.) If anyone else has a better solution please share.

First, to create the dataset in-case anyone else want to try it out:

Data work.have;
  input @01 ID 3.
        @05 Type $1.
        @07 Date date7.
        @18 Value $1.;
  format ID 3.
         Type $1.
         Date date11.
         Value $1.; 
  datalines;
001 F '02Jan18'n Y
001 B '03Jan18'n  
001 B '04Jan18'n Y
001 B '05Jan18'n  
001 B '06Jan18'n 
001 F '06Jan18'n Y
001 B '06Jan18'n 
001 B '07Jan18'n 
001 B '08Jan18'n Y
001 B '08Jan18'n 
001 B '09Jan18'n 
002 F '02Jan18'n Y
002 B '03Jan18'n 
002 B '04Jan18'n 
;
run;

Solution: I based on your edited suggestion of creating a flag variable.

Data Flag;
  set work.have;

  if Type = 'B' and Value = 'Y' then
    flag + 1;
  if Type = 'F' then
    flag = 0;

  if Value ne 'Y' and flag = 1 then delete;
run;

The flag variable is 0 by default.

The first IF-Then condition identifies the Type B ='Y' rows and flag them as 1, as well as retaining this flag for the subsequent rows.

The second IF-Then condition identifies the type='F' row and resets the Flag to 0

The Last If-Then condition drops all rows with Flag=1 except the first occurrence which are the Type B ='Y' rows.

I hope this applies to your problem.

Upvotes: 1

Pin
Pin

Reputation: 33

Same as Richard, I don't quite understand what the filtering criteria are.

I could see one problem with your join. you used a.* in your select statement, but "b" and "f" as your dataset aliases. this would not work as no dataset have been assigned to alias "a".

Proper way would be as follow:

proc sql;
  create table all as 
  select b.* from B as b
  inner join F as f
  on b.id=f.id
  and b.date >= f.date;
quit;

However, even then, I don't believe inner join is the proper way to solve your problem. Do let us your filtering condition please?

Upvotes: 1

Related Questions