Ezio_Auditore
Ezio_Auditore

Reputation: 43

Comparing values among rows and then deleting the duplicate values in SAS

I have a dataset which has certain variables , now I have to compare those variables among rows to find the  duplicate , but it has few extra asks .

         - if  final_amt matches the final_amt from next row and final_amt_added matches final_amt_added from next row we have to consider them as duplicate and delete only the case where the tran_date is blank.

          - Also if we are deleting the duplicate row with the blank tran_date , the actual_amt data from that row should be retained as a new variable act_amt_extra in the undeleted row.

I have tried multiple scenarios using lag and retain but the data is just not coming correctly, any help will be appreciated.

DATA work.sample;
INPUT acct_num test_id tran_date:anydtdte. actual_amt final_amt final_amt_added ;
format tran_date date9.;
DATALINES; 
55203610 2542 12-jan-20 30 45 45
16124130 8062 .         56 78 78 
16124130 8062 14-dec-19 8  78 78
80479512 2062 19-mar-19 32 32 32
70321918 2062 20-dec-19 1  93 54
17312410 6712 .         45 90 90
17312410 6712 15-jun-18 0  90 90
74623123 2092 17-aug-18 34 87 87
24245321 2082 22-jan-17 22 56 67
;
run;

data that I want

data want;
input acct_num test_id tran_date:anydtdte. actual_amt final_amt final_amt_added act_amt_extra;
format tran_date date9.;
DATALINES; 
55203610 2542 12-jan-20 30 45 45
16124130 8062 14-dec-19 8  78 78 56
80479512 2062 19-mar-19 32 32 32
70321918 2062 20-dec-19 1  93 54
17312410 6712 15-jun-18 0  90 90 45
74623123 2092 17-aug-18 34 87 87
24245321 2082 22-jan-17 22 56 67
;
run;

Upvotes: 0

Views: 649

Answers (2)

Lawrence
Lawrence

Reputation: 81

Here's a pretty elegant way to get your want:

proc sort data=sample;
    by acct_num tran_date;
run;

data want;
    set sample;
    by acct_num tran_date;
    act_amt_extra = lag(actual_amt);
    if first.acct_num then act_amt_extra = .;
    if tran_date = . then delete; 
run;

Upvotes: 0

Stu Sztukowski
Stu Sztukowski

Reputation: 12944

This may not be the most elegant way of handling it, but it will work so long as you only have a single duplicate as shown in your example.

The data we are most interested in is when the date is not missing. First, sort your data by acct_num test_id and descending tran_date:

proc sort data=sample;
    by acct_num test_id descending tran_date;
run;

This ensures that the data we care about is first:

acct_num    test_id tran_date   actual_amt  final_amt   final_amt_added
16124130    8062    14DEC2019   8           78          78
16124130    8062    .           56          78          78

We want to output only at the last combination of acct_num test_id. When there are duplicates, we want to pull forward the following values:

  • tran_date
  • actual_amt

Note that when there are duplicates and we want to pull values forward, we know the following:

  • last.test_id is true
  • first.test_id is false
  • tran_date is missing
  • final_amt matches the prior row's value
  • final_amt_added matches the prior row's value

When that case occurs, we'll run our logic. Otherwise, we'll leave the row as-is and output.

data want;
    set sample;
    by acct_num test_id descending tran_date;

    /* Store prior values */
    lag_actual_amt      = lag(actual_amt);
    lag_final_amt_added = lag(final_amt_added);
    lag_final_amt       = lag(final_amt);
    lag_tran_date       = lag(tran_date);

    /* Bring forward data if conditions are met */
    if( missing(tran_date)
        AND last.test_id 
        AND NOT first.test_id
        AND final_amt       = lag_final_amt
        AND final_amt_added = lag_final_amt_added
      )
    then do;
        act_amt_extra = actual_amt;

        /* Bring forward the prior values */
        tran_date     = lag_tran_date;
        actual_amt    = lag_actual_amt;
    end;

    /* Only output for the last combination of acct_num, test_id */
    if(last.test_id);

    drop lag:;
run;

Upvotes: 0

Related Questions