Reputation: 11
I have hundreds of thousands of IDs in a large dataset.
Some records have the same ID but different data points. Some of these IDs need to be merged into a single ID. People registered for a system more than once should be just one person in the database.
I also have a separate file that tells me which IDs need to be merged, but it's not always a one-to-one relationship. For example, in many cases I have x->y and then y->z because they registered three times. I had a macro that essentially was the following set of if-then statements:
if ID='1111111' then do; ID='2222222'; end;
if ID='2222222' then do; ID='3333333'; end;
I believe SAS runs this one record at a time. My list of merged IDs is almost 15k long, so it takes forever to run and the list just gets longer. Is there a faster method of updating these IDs?
Thanks
EDIT: Here is an example of the situation, except the macro is over 15k lines long due to all the merges.
data one;
input ID $5. v1 $ v2 $;
cards;
11111 a b
11111 c d
22222 e f
33333 g h
44444 i j
55555 k l
66666 m n
66666 o p
;
run;
%macro ID_Change;
if ID='11111' then do; ID='77777'; end; *77777 is a brand new ID;
if ID='22222' then do; ID='88888'; end; *88888 is a new ID but is merged below;
if ID='88888' then do; ID='99999'; end; *99999 becomes the newer ID;
%mend;
data two; set one; %ID_Change; run;
Upvotes: 1
Views: 188
Reputation: 12849
A hash table will greatly speed up the process. Hash tables are one of the little-used, but highly effective, tools in SAS. They're a bit bizarre since the syntax is very different from standard SAS programming. For now, think of it as a way to merge data together in-memory (a big reason as to why it's so fast).
First, create a dataset that has the conversions that you need. We want to match up by ID
, then convert it to New_ID
. Consider ID
as your key column, and New_ID
as your data column.
dataset: translate
ID New_ID
111111 222222
222222 333333
In a hash table, you need to consider two things:
The Data column is what will be replacing observations matched by the Key column. In other words, New_ID
will be populated every time there's a match for ID
.
Next, you'll want to do your hash merge. This is performed in the data step.
data want;
set have;
/* Only declare the hash object on the first iteration.
Otherwise it will do this every record. */
if(_N_ = 1) then do;
declare hash id_h(dataset: 'translate'); *Declare a hash object called 'id_h';
id_h.defineKey('ID'); *Define key for matching;
id_h.defineData('New_ID'); *The new ID after matching;
id_h.defineDone(); *Done declaring this hash object;
call missing(New_ID); *Prevents a warning in the log;
end;
/* If a customer has changed multiple times, keep iterating until
there is no longer a match between tables */
do while(id_h.Find() = 0);
_loop_count+1; *Tells us how long we've been in the loop;
/* Just in case the while loop gets to 500 iterations, then
there's likely a problem and you don't want the data step to get stuck */
if(_loop_count > 500) then do;
put 'WARNING: ' ID ' iterated 500 times. The loop will stop. Check observation ' _N_;
leave;
end;
/* If the ID of the hash table matches the ID of the dataset, then
we'll set ID to be New_ID from the hash object;
ID = New_ID;
end;
_loop_count = 0;
drop _loop_count;
run;
This should run very quickly and provide the desired output, assuming that your lookup table is coded in the way that you need it to be.
Upvotes: 2
Reputation: 18995
Use PROC SQL
or a MERGE
step against your separate file (after you have created a separate dataset from it, using infile
or proc import
) to append this unique id to all records. If your separate file contains only the duplicates, you will need to create a dummy unique id for the non-duplicates.
Do PROC SORT
with BY
unique id and timestamp of signup.
Use a DATA
step with the same BY
variables. Depending on whether you want to keep the first or last signup, do if first.timestamp then output;
(or last, etc.)
Or you could do it all in one PROC SQL
using a left join
to the separate file, a coalesce
step to return a dummy unique id if it is not contained in the separate file, a group by
unique id, and a having max(timestamp)
(or min). You can also coalesce
any other variables you might want to try to preserve across signups -- for example, if the first signup contained a phone number and successive signups were missing that data point.
Without a reproducible example it's hard to be more specific.
Upvotes: 0