Reputation: 217
I've got some weird results I don't quite understand. I create a data set in a data step, using several data sets in the set statement. There is a variable that is present in some of the datasets, but not in all of them. If this variable is missing in my new dataset, I want to give it some value. This creates a dangerously non-intuitive result and no warnings or errors.
In the example below, y is not present in test1. When creating test3, the behavior is as I would expect: z is assigned the x value from the same row for all the observations coming from test1. But test4 is not what i expect: the first value of x is repeated for all the rows from test1. Why is this?
data test1;
x=1;
output;
x=2;
output;
x=3;
output;
run;
data test2;
x=1;
y=2;
run;
data test3;
set test1 test2;
if missing(y) then z=x;
run;
data test4;
set test1 test2;
if missing(y) then y=x;
run;
Upvotes: 2
Views: 421
Reputation: 51601
Variables that are created new by the data step, like the Z in your step that creates TEST3, are set to missing at the start of each iteration of the data step.
But variables that are coming from source datasets are "retained" (that is not set to missing automatically). So in the data step that creates TEST4 once a value is assigned to Y it is retained. Of course when the SET statement reads an observation from TEST2 the value of Y that had been retained from the previous iteration is overwritten.
Add some PUT statements so you can watch the values of X Y (and Z) as they change. First data step:
1234 data test3;
1235 put 'Before SET: ' (_n_ x y z) (=);
1236 set test1 test2;
1237 put ' After SET: ' (_n_ x y z) (=);
1238 if missing(y) then z=x;
1239 put ' After IF : ' (_n_ x y z) (=);
1240 run;
Before SET: _N_=1 x=. y=. z=.
After SET: _N_=1 x=1 y=. z=.
After IF : _N_=1 x=1 y=. z=1
Before SET: _N_=2 x=1 y=. z=.
After SET: _N_=2 x=2 y=. z=.
After IF : _N_=2 x=2 y=. z=2
Before SET: _N_=3 x=2 y=. z=.
After SET: _N_=3 x=3 y=. z=.
After IF : _N_=3 x=3 y=. z=3
Before SET: _N_=4 x=3 y=. z=.
After SET: _N_=4 x=1 y=2 z=.
After IF : _N_=4 x=1 y=2 z=.
Before SET: _N_=5 x=1 y=2 z=.
NOTE: There were 3 observations read from the data set WORK.TEST1.
NOTE: There were 1 observations read from the data set WORK.TEST2.
NOTE: The data set WORK.TEST3 has 4 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
Second data step:
1241
1242 data test4;
1243 put 'Before SET : ' (_n_ x y) (=);
1244 set test1 test2;
1245 put ' After SET : ' (_n_ x y) (=);
1246 if missing(y) then y=x;
1247 put ' After IF : ' (_n_ x y) (=);
1248 run;
Before SET : _N_=1 x=. y=.
After SET : _N_=1 x=1 y=.
After IF : _N_=1 x=1 y=1
Before SET : _N_=2 x=1 y=1
After SET : _N_=2 x=2 y=1
After IF : _N_=2 x=2 y=1
Before SET : _N_=3 x=2 y=1
After SET : _N_=3 x=3 y=1
After IF : _N_=3 x=3 y=1
Before SET : _N_=4 x=3 y=1
After SET : _N_=4 x=1 y=2
After IF : _N_=4 x=1 y=2
Before SET : _N_=5 x=1 y=2
NOTE: There were 3 observations read from the data set WORK.TEST1.
NOTE: There were 1 observations read from the data set WORK.TEST2.
NOTE: The data set WORK.TEST4 has 4 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
Upvotes: 1
Reputation: 4937
The answer is in the When Variable Values Are Automatically Set to Missing by SAS section of Missing Variable Values Doc :
When variables are read with a SET, MERGE, or UPDATE statement, SAS sets the values to missing only before the first iteration of the DATA step. (If you use a BY statement, the variable values are also set to missing when the BY group changes.) The variables retain their values until new values become available (for example, through an assignment statement or through the next execution of the SET, MERGE, or UPDATE statement). Variables created with options in the SET, MERGE, and UPDATE statements also retain their values from one iteration to the next.
Meaning that in the test4 data step, the if missing(y)
is true only in the first iteration of your data step. Then, you set y = 1, which is retained in the PDV.
That is not an issue in test3
, because you do not overwrite y
.
Upvotes: 2