losman
losman

Reputation: 11

SAS Updating records sequentially

I have hundreds of thousands of IDs in a large dataset.

Some records have the same ID but different data points. Some of these IDs need to be merged into a single ID. People registered for a system more than once should be just one person in the database.

I also have a separate file that tells me which IDs need to be merged, but it's not always a one-to-one relationship. For example, in many cases I have x->y and then y->z because they registered three times. I had a macro that essentially was the following set of if-then statements:

if ID='1111111' then do; ID='2222222'; end; 
if ID='2222222' then do; ID='3333333'; end;

I believe SAS runs this one record at a time. My list of merged IDs is almost 15k long, so it takes forever to run and the list just gets longer. Is there a faster method of updating these IDs?

Thanks

EDIT: Here is an example of the situation, except the macro is over 15k lines long due to all the merges.

data one; 
input ID $5. v1 $ v2 $;
cards;
11111 a b
11111 c d
22222 e f
33333 g h
44444 i j
55555 k l
66666 m n
66666 o p
;
run;

%macro ID_Change;
if ID='11111' then do; ID='77777'; end; *77777 is a brand new ID;
if ID='22222' then do; ID='88888'; end; *88888 is a new ID but is merged below;
if ID='88888' then do; ID='99999'; end; *99999 becomes the newer ID;
%mend;

data two; set one; %ID_Change; run;

Upvotes: 1

Views: 188

Answers (2)

Stu Sztukowski
Stu Sztukowski

Reputation: 12849

A hash table will greatly speed up the process. Hash tables are one of the little-used, but highly effective, tools in SAS. They're a bit bizarre since the syntax is very different from standard SAS programming. For now, think of it as a way to merge data together in-memory (a big reason as to why it's so fast).

First, create a dataset that has the conversions that you need. We want to match up by ID, then convert it to New_ID. Consider ID as your key column, and New_ID as your data column.

dataset: translate

ID     New_ID
111111 222222
222222 333333

In a hash table, you need to consider two things:

  1. The Key column(s)
  2. The Data column(s)

The Data column is what will be replacing observations matched by the Key column. In other words, New_ID will be populated every time there's a match for ID.

Next, you'll want to do your hash merge. This is performed in the data step.

data want;
     set have;

     /* Only declare the hash object on the first iteration. 
        Otherwise it will do this every record. */
     if(_N_ = 1) then do;
           declare hash id_h(dataset: 'translate'); *Declare a hash object called 'id_h';
           id_h.defineKey('ID');                    *Define key for matching;
           id_h.defineData('New_ID');               *The new ID after matching;
           id_h.defineDone();                       *Done declaring this hash object;
           call missing(New_ID);                    *Prevents a warning in the log;
     end;

    /* If a customer has changed multiple times, keep iterating until 
       there is no longer a match between tables */
    do while(id_h.Find() = 0);

        _loop_count+1; *Tells us how long we've been in the loop;

       /* Just in case the while loop gets to 500 iterations, then 
          there's likely a problem and you don't want the data step to get stuck */
       if(_loop_count > 500) then do;
            put 'WARNING: ' ID ' iterated 500 times. The loop will stop. Check observation ' _N_;
            leave;
       end; 

        /* If the ID of the hash table matches the ID of the dataset, then
           we'll set ID to be New_ID from the hash object;
        ID = New_ID; 
    end;

    _loop_count = 0;

   drop _loop_count;
run;

This should run very quickly and provide the desired output, assuming that your lookup table is coded in the way that you need it to be.

Upvotes: 2

C8H10N4O2
C8H10N4O2

Reputation: 18995

  1. Use PROC SQL or a MERGE step against your separate file (after you have created a separate dataset from it, using infile or proc import) to append this unique id to all records. If your separate file contains only the duplicates, you will need to create a dummy unique id for the non-duplicates.

  2. Do PROC SORT with BY unique id and timestamp of signup.

  3. Use a DATA step with the same BY variables. Depending on whether you want to keep the first or last signup, do if first.timestamp then output; (or last, etc.)

Or you could do it all in one PROC SQL using a left join to the separate file, a coalesce step to return a dummy unique id if it is not contained in the separate file, a group by unique id, and a having max(timestamp) (or min). You can also coalesce any other variables you might want to try to preserve across signups -- for example, if the first signup contained a phone number and successive signups were missing that data point.

Without a reproducible example it's hard to be more specific.

Upvotes: 0

Related Questions