Betsy G.
Betsy G.

Reputation: 11

SAS 9.4 Replacing all values after current line based on current values

I am matching files base on IDs numbers. I need to format a data set with the IDs to be matched, so that the same ID number is not repeated in column a (because column b's ID is the surviving ID after the match is completed). My list of IDs has over 1 million observations, and the same ID may be repeated multiple times in either/both columns.

Here is an example of what I've got/need:

Sample Data

ID1 ID2

1 2    
3 4    
2 5
6 1 
1 7 
5 8    

The surviving IDs would be:

2    
4    
5

error - 1 no longer exists error - 1 no longer exists 8

WHAT I NEED

ID1 ID2

1 2    
3 4    
2 5    
6 5
5 7
7 8

I am, probably very obviously, a SAS novice, but here is what I have tried, re-running over and over again because I have some IDs that are repeated upward of 50 times or more.

Proc sort data=Have;    
    by ID1;    
run;

This sort makes the repeated ID1 values consecutive, so the I could use LAG to replace the destroyed ID1s with the surviving ID2 from the line above.

Data Want;
    set Have;
        by ID1;
    lagID1=LAG(ID1);  
    lagID2=LAG(ID2); 
    If NOT first. ID1 THEN DO;  
        If ID1=lagID1 THEN ID1=lagID2; 
        KEEP ID1 ID2;
        IF ID1=ID2 then delete;
   end;
run;

That sort of works, but I still end up with some that end up with duplicates that won't resolve no matter how many times I run (I would have looped it, but I don't know how), because they are just switching back and forth between IDs that have other duplicates (I can get down to about 2,000 of these).

I have figured out that instead of using LAG, I need replace all values after the current line with ID2 for each ID1 value, but I cannot figure out how to do that.

I want to read observation 1, find all later instances of the value of ID1, in both ID1 or ID2 columns, and replace that value with the current observation's ID2 value. Then I want to repeat that process with line 2 and so on.

For the example, I would want to look for any instances after line one of the value 1, and replace it with 2, since that is the surviving ID of that pair - 1 may appear further down multiple times in either of the columns, and I need all them to replaced. Line two would look for later values of 3 and replace them with 4, and so one. The end result should be that an ID number only appears once ever in the ID1 column (though it may appear multiple times in the ID2 column).

ID1 ID2

1 2    
3 4    
2 5
6 1 
1 7 
5 8 

After first line has been read, data set would look as follows: ID1 ID2

1 2    
3 4    
2 5
6 2 
2 7 
5 8 

Reading observation two would make no changes since 3 does not appear again; after observation 3, the set would be:

ID1 ID2

1 2    
3 4    
2 5
6 5 
5 7 
5 8 

Again, there would be not changes from observation four. but observation 5 would cause the final change:

ID1 ID2

1 2    
3 4    
2 5
6 5 
5 7 
7 8 

I have tried using the following statement but I can't even tell if I am on the complete wrong track or if I just can't get the syntax figured out.

Data want;
Set have;
      Do i=_n_;
          ID=ID2;
          Replace next var{EUID} where (EUID1=EUID1 AND EUID2=EUID1);
      End;
Run;

Thanks for your help!

Upvotes: 1

Views: 783

Answers (4)

Tom
Tom

Reputation: 51566

There is no need to work back and forth thru the data file. You just need to retain the replacement information so that you can process the file in a single pass.

One way to do that is to make a temporary array using the values of the ID variables as the index. That is easy to do for your simple example with small ID values.

So for example if all of the ID values are integers between 1 and 1000 then this step will do the job.

data want ;
  set have ;
  array xx (1000) _temporary_;
  do while (not missing(xx(id1))); id1=xx(id1); end;
  do while (not missing(xx(id2))); id2=xx(id2); end;
  output;
  xx(id1)=id2;
run;

You probably need to add a test to prevent cycles (1 -> 2 -> 1).

For a more general solution you should replace the array with a hash object instead. So something like this:

data want ;
  if _n_=1 then do;
    declare hash h();
    h.definekey('old');
    h.definedata('new');
    h.definedone();
    call missing(new,old);
  end;
  set have ;
  do while (not h.find(key:id1)); id1=new; end;
  do while (not h.find(key:id2)); id2=new; end;
  output;
  h.add(key: id1,data: id2);
  drop old new;
run;

Upvotes: 1

DCR
DCR

Reputation: 15657

this might be a little faster:

data have2;
input id1 id2;
datalines;
1 2    
3 4    
2 5
6 1 
1 7 
5 8 
;
run;

%macro test2();
   proc sql noprint;
      select count(*) into: cnt
      from have2;
   quit;

   %do i = 1 %to &cnt;
      proc sql noprint;
         select id1,id2 into: id1, :id2
         from have2
         where monotonic() = &i;

         update have2 set id1 = &id2         
         where monotonic() > &i
         and id1 = &id1;
      quit;
      proc sql noprint;
         update have2 set id2 = &id2         
         where monotonic() > &i
         and id2 = &id1;
      quit;
%end;
%mend test2;
%test2();

Upvotes: 0

DCR
DCR

Reputation: 15657

Seems like this should do the trick and is fairly straight forward. Let me know if it is what you are looking for:

data have;
input id1 id2;
datalines;
1 2    
3 4    
2 5
6 1 
1 7 
5 8 
;
run;

%macro test();
  proc sql noprint;
     select count(*) into: cnt
     from have;
  quit;

  %do i = 1 %to &cnt;
     proc sql noprint;
        select id1,id2 into: id1, :id2
        from have
        where monotonic() = &i;quit;

     data have;
     set have;
     if (_n_ > input("&i",8.))then do;
        if (id1 = input("&id1",8.))then id1 = input("&id2",8.);
        if (id2 = input("&id1",8.))then id2 = input("&id2",8.);
     end;
     run;        
  %end;
%mend test;
%test();

Upvotes: 0

user667489
user667489

Reputation: 9569

Here's an implementation of the algorithm you've suggested, using a modify statement to load and rewrite each row one at a time. It works with your trivial example but with messier data you might get duplicate values in ID1.

data have;
input ID1 ID2 ;
datalines;
1 2    
3 4    
2 5
6 1 
1 7 
5 8 
;
run;

title "Before making replacements";
proc print data = have;
run;

/*Optional - should improve performance at cost of increased memory usage*/
sasfile have load;

data have;
    do i = 1 to nobs;
        do j = i to nobs;
            modify have point = j nobs = nobs;
            /* Make copies of target and replacement value for this pass */
            if j = i then do;
                id1_ = id1;
                id2_ = id2;
            end;
            else do;
                flag = 0; /* Keep track of whether we made a change */
                if id1 = id1_ then do;
                    id1 = id2_;
                    flag = 1;
                end;
                if id2 = id1_ then do;
                    id2 = id2_;
                    flag = 1;
                end;
                if flag then replace; /* Only rewrite the row if we made a change */                
            end;
        end;
    end;
    stop;
run;

sasfile have close;

title "After making replacements";
proc print data = have;
run;

Please bear in mind that as this modifies the dataset in place, interrupting the data step while it is running could result in data loss. Make sure you have a backup first in case you need to roll your changes back.

Upvotes: 0

Related Questions