RuneS
RuneS

Reputation: 217

Missing values and assignment with multiple datasets in set statement

I've got some weird results I don't quite understand. I create a data set in a data step, using several data sets in the set statement. There is a variable that is present in some of the datasets, but not in all of them. If this variable is missing in my new dataset, I want to give it some value. This creates a dangerously non-intuitive result and no warnings or errors.

In the example below, y is not present in test1. When creating test3, the behavior is as I would expect: z is assigned the x value from the same row for all the observations coming from test1. But test4 is not what i expect: the first value of x is repeated for all the rows from test1. Why is this?

data test1;
    x=1;
    output;
    x=2;
    output;
    x=3;
    output;
run;

data test2;
    x=1;
    y=2;
run;

data test3;
    set test1 test2;
    if missing(y) then z=x;
run;

data test4;
    set test1 test2;
    if missing(y) then y=x;
run;

Upvotes: 2

Views: 421

Answers (2)

Tom
Tom

Reputation: 51601

Variables that are created new by the data step, like the Z in your step that creates TEST3, are set to missing at the start of each iteration of the data step.

But variables that are coming from source datasets are "retained" (that is not set to missing automatically). So in the data step that creates TEST4 once a value is assigned to Y it is retained. Of course when the SET statement reads an observation from TEST2 the value of Y that had been retained from the previous iteration is overwritten.

Add some PUT statements so you can watch the values of X Y (and Z) as they change. First data step:

1234  data test3;
1235    put 'Before SET: ' (_n_ x y z) (=);
1236    set test1 test2;
1237    put ' After SET: ' (_n_ x y z) (=);
1238    if missing(y) then z=x;
1239    put ' After IF : ' (_n_ x y z) (=);
1240  run;

Before SET: _N_=1 x=. y=. z=.
 After SET: _N_=1 x=1 y=. z=.
 After IF : _N_=1 x=1 y=. z=1
Before SET: _N_=2 x=1 y=. z=.
 After SET: _N_=2 x=2 y=. z=.
 After IF : _N_=2 x=2 y=. z=2
Before SET: _N_=3 x=2 y=. z=.
 After SET: _N_=3 x=3 y=. z=.
 After IF : _N_=3 x=3 y=. z=3
Before SET: _N_=4 x=3 y=. z=.
 After SET: _N_=4 x=1 y=2 z=.
 After IF : _N_=4 x=1 y=2 z=.
Before SET: _N_=5 x=1 y=2 z=.
NOTE: There were 3 observations read from the data set WORK.TEST1.
NOTE: There were 1 observations read from the data set WORK.TEST2.
NOTE: The data set WORK.TEST3 has 4 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

Second data step:

1241
1242  data test4;
1243    put 'Before SET : ' (_n_ x y) (=);
1244    set test1 test2;
1245    put ' After SET : ' (_n_ x y) (=);
1246    if missing(y) then y=x;
1247    put ' After IF  : ' (_n_ x y) (=);
1248  run;

Before SET : _N_=1 x=. y=.
 After SET : _N_=1 x=1 y=.
 After IF  : _N_=1 x=1 y=1
Before SET : _N_=2 x=1 y=1
 After SET : _N_=2 x=2 y=1
 After IF  : _N_=2 x=2 y=1
Before SET : _N_=3 x=2 y=1
 After SET : _N_=3 x=3 y=1
 After IF  : _N_=3 x=3 y=1
Before SET : _N_=4 x=3 y=1
 After SET : _N_=4 x=1 y=2
 After IF  : _N_=4 x=1 y=2
Before SET : _N_=5 x=1 y=2
NOTE: There were 3 observations read from the data set WORK.TEST1.
NOTE: There were 1 observations read from the data set WORK.TEST2.
NOTE: The data set WORK.TEST4 has 4 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Upvotes: 1

PeterClemmensen
PeterClemmensen

Reputation: 4937

The answer is in the When Variable Values Are Automatically Set to Missing by SAS section of Missing Variable Values Doc :

When variables are read with a SET, MERGE, or UPDATE statement, SAS sets the values to missing only before the first iteration of the DATA step. (If you use a BY statement, the variable values are also set to missing when the BY group changes.) The variables retain their values until new values become available (for example, through an assignment statement or through the next execution of the SET, MERGE, or UPDATE statement). Variables created with options in the SET, MERGE, and UPDATE statements also retain their values from one iteration to the next.

Meaning that in the test4 data step, the if missing(y) is true only in the first iteration of your data step. Then, you set y = 1, which is retained in the PDV.

That is not an issue in test3, because you do not overwrite y.

Upvotes: 2

Related Questions