pinegulf
pinegulf

Reputation: 1396

SAS multiple dataset and Coalesce function

Good day,

I have old data with different format and updating it with new. Idea is that variable WantedName had been written with underscore Wanted_Name whereas new data is without this. No big deal I though.

EDIT: WantedName and Wanted_Name are SAS dates.

Original data:

year Wanted_Name 
2013  1234       
2013  4321       
2013  3241       

and from year 2015 forward:

year WantedName 
2015 5678     
2015 8765
....

I tried the logic:

%macro macro_env;
data _null_; 
    call symput ("curr_year", year(date()) ); 
run;

data long_data;
    set 
    %do year=2013 %to &curr_year.;
        data.history&year.
    %end;
    ;
    WantedName=COALESCE(WantedName, Wanted_Name);
run;
%mend macro_env;

Nope. Did not work. For some reason it took the first value of Wanted_Name and pasted that over the whole data range except for the new data:

year Wanted_Name WantedName
2013  1234        1234
2013  4321        1234
2013  3241        1234
....
2015  .           5678     
2015  .           8765

Now I managed to fix the issue by removing the coalesce function and adding a additional data statement:

data long_data;
    set 
    %do year=2013 %to &curr_year.;
        data.history&year.
    %end;
    ;
run;

data long_data;
    set long_data;
    WantedName=COALESCE(WantedName, Wanted_Name);
run;

Question: What happens or why isn't the original macro_env working?

I thought datasets in set were first loaded and then function applied. (Which is working in the latter data statement. Apparently not... Maybe?

Upvotes: 3

Views: 1581

Answers (3)

Joe
Joe

Reputation: 63424

I think a better way to do this is to handle it with rename in the set statement.

data long_data;
    set 
    %do year=2013 %to &curr_year.;

        data.history&year.
          %if &year lt 2015 %then %do;
           (rename=wanted_name=wantedname)
          %end;
    %end;
    ;
run;

Cheaper (rename is cheaper than coalesce function) and simpler.

Upvotes: 2

Tom
Tom

Reputation: 51566

As answer form @Allan Bowe explains the issue is that variables sourced for datasets are automatically retained. So you could add code to clear them.

data long_data;
  set data.history20: ; 
  WantedName=COALESCE(WantedName, Wanted_Name);
  output;
  call missing(wantedname,wanted_name);
run;

Or make a NEW variable.

data long_data;
  set data.history20: ; 
  new_WantedName=COALESCE(WantedName, Wanted_Name);
  format new_WantedName date9. ;
  drop WantedName Wanted_Name ;
  rename new_WantedName = WantedName ;
run;

Upvotes: 4

Allan Bowe
Allan Bowe

Reputation: 12691

The reason is that the pdv implicitly retains variables that have been "set", and wantedName is not being flushed from the pdv when the earlier datasets (pre 2015) are being loaded.

The compiler made a space for it (by reading all the datasets in the set statement) but the set command does not (initially) replace the value of wantedName.

So - when the first observation is read, wantedname is missing and the first value of wanted_name is applied.

WantedName=COALESCE(., 1234); /* obs 1 */

In the second iteration, the value of wantedname has been retained, and so is used for every subsequent iteration until a dataset is read in that contains wantedname.

WantedName=COALESCE(1234, 4321); /* obs 2 */

The second example uses a dataset from the beginning that has the wantedname variable.

Great question!

Upvotes: 2

Related Questions