Reputation: 357
I have a data set of card history as below. For each customer, they may have applied for one or multiple cards on the same day. However, due to various reasons, their cards get replaced. Card issue date is when a card is issued. New Card ID is the ID for the replaced card. For example, for customer A, his card was firstly issue on 2/1/2017 and the card ID is 1234. 3 days later, he lost his card and a new card (1235) is issued on 5/2/2017.
Customer ID First Issue Date Card Issue Date Card ID New Card ID
A 2/1/2017 2/1/2017 1234 1235
A 2/1/2017 5/2/2017 1235
B 5/2/2017 5/2/2017 1245 1248
B 5/2/2017 5/2/2017 1236 1249
B 5/2/2017 10/3/2017 1248 1250
B 5/2/2017 5/3/2017 1249 1251
B 5/2/2017 10/4/2017 1250
B 5/2/2017 5/4/2017 1251
What I want is to group the original card and all the replacement together. For example, customer B applied for two cards on 5/2/217. Card ID 1245, 1248 and 1250 are in the same group (Seq No 1) and Card ID 1236, 1249 and 1251 are in the same group (Seq No 2).
Customer ID Open Date Card Issue Date Card ID Seq No
A 2/1/2017 2/1/2017 1234 1
A 2/1/2017 5/2/2017 1235 1
B 5/2/2017 5/2/2017 1245 1
B 5/2/2017 10/3/2017 1248 1
B 5/2/2017 10/4/2017 1250 1
B 5/2/2017 5/2/2017 1236 2
B 5/2/2017 5/3/2017 1249 2
B 5/2/2017 5/4/2017 1251 2
Please help me with this data transformation.
Here is the data step for the input file
data test;
infile datalines dsd truncover ;
input Customer:$1.
First_Issue_Date: ddmmyy10.
Card_Issue_Date: ddmmyy10.
Card_ID: $4.
New_Card_ID: $4. ;
format First_Issue_Date ddmmyy10. Card_Issue_Date ddmmyy10.;
datalines;
A,02/01/2017,02/01/2017,1234,1235,
A,02/01/2017,05/02/2017,1235,,
B,05/02/2017,05/02/2017,1245,1248,
B,05/02/2017,05/02/2017,1236,1249,
B,05/02/2017,10/03/2017,1248,1250,
B,05/02/2017,05/03/2017,1249,1251,
B,05/02/2017,10/04/2017,1250,,
B,05/02/2017,05/04/2017,1251,,
;
Upvotes: 1
Views: 65
Reputation: 27498
The DATA Step hash object is very effective for traversing paths in identity tracked data. Presuming every Card_ID is unique over all customers, and each New_Card_ID value has a corresponding Card_ID value in the data set, then this code will find unique path ids amongst the myriad of reissues.
data paths(keep=card_id path_id);
if 0 then set have; * prep pdv;
call missing (Path_ID);
* for tracking the tip of the card_id trail;
DECLARE HASH currentCard(hashexp: 9);
currentCard.defineKey ('Card_ID');
currentCard.defineData ('Card_ID', 'Path_ID');
currentCard.defineDone();
* for tracking everything up to the tip (segment);
DECLARE HASH replacedCard(hashexp:10);
replacedCard.defineKey ('New_Card_ID');
replacedCard.defineData('Card_ID');
replacedCard.defineDone();
* fill the two hashes;
do until (lastrow);
set have (keep=Card_ID New_Card_ID) end=lastrow;
if missing(New_Card_ID) then
Path_ID + 1;
if missing(New_Card_ID)
then currentCard.add();
else replacedCard.add();
end;
* for each tip of a path output the tip and all its segments;
declare hiter tipIter('currentCard');
do while (tipIter.next() = 0);
output; * tip;
do while (replacedCard.find(key:Card_ID) = 0);
output; * segment;
end;
end;
stop;
run;
If you really need Seq = 1..N
within Customer you will have to do additional sorting and merging.
My NESUG 2009 paper "Using HASH to find a sum over a transactional path" has a similar discussion about linked transactions.
Upvotes: 1
Reputation: 12465
What you are looking for is a Connected Component analysis. If you have it, PROC OPTNET
can give you what you want.
Unfortunately, it doesn't support a BY
statement, so you will have to generated the sequence number after you use it to group the cards.
First create the node, "to/from" data from your card data.
data nodes;
set test;
node = put(_n_,best12.);
from = card_id;
to = new_card_id;
if to = . then to=from;
run;
Then run the analysis.
proc optnet data_links=nodes out_nodes=nodes_out;
concomp;
run;
This generates a list of cards and their group (variable concomp
).
Join that group back to the original data and sort it.
proc sql noprint;
create table want as
select a.customer,
a.First_Issue_Date,
a.Card_Issue_Date,
a.Card_ID,
b.concomp
from test as a
left join
nodes_out as b
on a.card_id = b.node
order by customer, concomp, Card_Issue_Date;
quit;
Now the groups are just ordered 1, 2, ..., N. You want can use a Data Step to take that information and create the seq_no
data want(drop=concomp);
set want;
by customer concomp;
retain seq_no ;
if first.customer then
seq_no = 0;
if first.concomp then
seq_no = seq_no + 1;
run;
Upvotes: 0