Reputation: 1319
I'm trying to understand the behaviour of the code below. It seems wrong to me but I'd appreciate someone else taking a look in case I'm just going mad or something.
data testdata;
length var1 var2 $ 10;
var1 = "house"; var2 = ""; output;
var1 = "house"; var2 = "car"; output;
var1 = "house"; var2 = "house"; output;
run;
proc sql;
* Select all three obs- ok;
create table try1 as select * from testdata where indexw("house", var1);
* Selects one obs - ok;
create table try2 as select * from testdata where indexw("house", var2);
* Selects one obs - ok;
create table try3 as select * from testdata where indexw("car", var2);
* Selects all three obs - why?;
create table try4 as select * from testdata where indexw("house", var1) and indexw("house", var2);
* Selects one obs - ok;
create table try5 as select * from testdata where indexw("house", var1) and indexw("car", var2);
* Explicit comparison to zero - selects one obs - ok;
create table try6 as select * from testdata where indexw("house", var1) and (indexw("house", var2) ne 0);
* Compare to VAR2 first - selects one obs - ok;
create table try7 as select * from testdata where indexw("house", var2) and indexw("house", var1);
quit;
When I run this code, TRY1 has three observations and TRY2 has one - the ones with VAR1="house" and VAR2="house" respectively. This is what I would expect, and based on that, I would expect TRY4 to only contain the single observation where both VAR1 and VAR2 are "house". Instead TRY4 selects all three observations from the input.
Even more strangely, TRY6 uses an explicit compare to zero and only selects one observation as expected, as does TRY7, which reverses the order of the comparisons.
A similar thing happens in data step if I use a WHERE statement, but not if I use an IF statement:
114 data t;
115 set testdata;
116 where indexw("house", var1) and indexw("house", var2);
117 run;
NOTE: There were 3 observations read from the data set WORK.TESTDATA.
WHERE not (not INDEXW('house', var1));
NOTE: The data set WORK.T has 3 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
118 data t;
119 set testdata;
120 if indexw("house", var1) and indexw("house", var2);
121 run;
NOTE: There were 3 observations read from the data set WORK.TESTDATA.
NOTE: The data set WORK.T has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
Note that the first data step outputs all three observations, while the second only outputs one.
The log for the first step reveals a clue - it looks like SAS has re-interpreted the WHERE clause in a way that changes its meaning.
What's going on here?
Run in 64-bit SAS 9.4 (TS1M7) on Windows 10.
Upvotes: 1
Views: 96
Reputation: 51611
The bug is in SAS at least as far back as 9.4M5 also. Contact SAS directly to get details on this bug. You can raise a ticket at: http://support.sas.com
As I remember it is related to using the same function (possibly it also requires that both usages use the same first argument?) and also as your example shows treating the function result as a boolean.
The similarity in the two expressions is confusing the optimization logic. The work around of adding a explicit comparison operator that you showed works to avoid the bug.
where 0<indexw("house", var1) and 0<indexw("house", var2);
Upvotes: 2