Cypress
Cypress

Reputation: 357

Complex data restructure issue in SAS

I have a data set of card history as below. For each customer, they may have applied for one or multiple cards on the same day. However, due to various reasons, their cards get replaced. Card issue date is when a card is issued. New Card ID is the ID for the replaced card. For example, for customer A, his card was firstly issue on 2/1/2017 and the card ID is 1234. 3 days later, he lost his card and a new card (1235) is issued on 5/2/2017.

Customer ID First Issue Date    Card Issue Date Card ID New Card ID
A   2/1/2017    2/1/2017    1234    1235
A   2/1/2017    5/2/2017    1235     
B   5/2/2017    5/2/2017    1245    1248
B   5/2/2017    5/2/2017    1236    1249
B   5/2/2017    10/3/2017   1248    1250
B   5/2/2017    5/3/2017    1249    1251
B   5/2/2017    10/4/2017   1250     
B   5/2/2017    5/4/2017    1251    

 

What I want is to group the original card and all the replacement together. For example, customer B applied for two cards on 5/2/217. Card ID 1245, 1248 and 1250 are in the same group (Seq No 1) and Card ID 1236, 1249 and 1251 are in the same group (Seq No 2).

Customer ID Open Date   Card Issue Date Card ID Seq No
A   2/1/2017    2/1/2017    1234    1
A   2/1/2017    5/2/2017    1235    1
B   5/2/2017    5/2/2017    1245    1
B   5/2/2017    10/3/2017   1248    1
B   5/2/2017    10/4/2017   1250    1
B   5/2/2017    5/2/2017    1236    2
B   5/2/2017    5/3/2017    1249    2
B   5/2/2017    5/4/2017    1251    2

Please help me with this data transformation.

Here is the data step for the input file

data test;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;

Upvotes: 1

Views: 65

Answers (2)

Richard
Richard

Reputation: 27498

The DATA Step hash object is very effective for traversing paths in identity tracked data. Presuming every Card_ID is unique over all customers, and each New_Card_ID value has a corresponding Card_ID value in the data set, then this code will find unique path ids amongst the myriad of reissues.

data paths(keep=card_id path_id);
  if 0 then set have; * prep pdv;

  call missing (Path_ID);

  * for tracking the tip of the card_id trail;
  DECLARE HASH currentCard(hashexp: 9);
  currentCard.defineKey ('Card_ID');
  currentCard.defineData ('Card_ID', 'Path_ID');
  currentCard.defineDone();

  * for tracking everything up to the tip (segment);
  DECLARE HASH replacedCard(hashexp:10);
  replacedCard.defineKey ('New_Card_ID');     
  replacedCard.defineData('Card_ID');
  replacedCard.defineDone();

  * fill the two hashes;
  do until (lastrow);
    set have (keep=Card_ID New_Card_ID) end=lastrow;

    if missing(New_Card_ID) then 
      Path_ID + 1;

    if missing(New_Card_ID)
      then currentCard.add();
      else replacedCard.add(); 
  end;

  * for each tip of a path output the tip and all its segments;
  declare hiter tipIter('currentCard');
  do while (tipIter.next() = 0);
    output; * tip;

    do while (replacedCard.find(key:Card_ID) = 0);
      output; * segment;
    end;
  end;

  stop;
run;

If you really need Seq = 1..N within Customer you will have to do additional sorting and merging.

My NESUG 2009 paper "Using HASH to find a sum over a transactional path" has a similar discussion about linked transactions.

Upvotes: 1

DomPazz
DomPazz

Reputation: 12465

What you are looking for is a Connected Component analysis. If you have it, PROC OPTNET can give you what you want.

Unfortunately, it doesn't support a BY statement, so you will have to generated the sequence number after you use it to group the cards.

First create the node, "to/from" data from your card data.

data nodes;
set test;
node = put(_n_,best12.);
from = card_id;
to = new_card_id;
if to = . then to=from;
run;

Then run the analysis.

proc optnet data_links=nodes out_nodes=nodes_out;
concomp;
run;

This generates a list of cards and their group (variable concomp).

Join that group back to the original data and sort it.

proc sql noprint;
create table want as
select a.customer,
       a.First_Issue_Date,
       a.Card_Issue_Date,
       a.Card_ID,
       b.concomp
    from test as a
      left join
         nodes_out as b
    on a.card_id = b.node
    order by customer, concomp, Card_Issue_Date;
quit;

Now the groups are just ordered 1, 2, ..., N. You want can use a Data Step to take that information and create the seq_no

data want(drop=concomp);
set want;
by customer concomp;
retain seq_no ;

if first.customer then
    seq_no = 0;
if first.concomp then
    seq_no = seq_no + 1;
run; 

Upvotes: 0

Related Questions