Reputation: 35
From what I can tell both example code blocks will produce the same output, but the in the first statement the where is applied after the query inside is completed whereas in the second statement the where is applied during the running of the statement.
I am correct in thinking the second option will be more performant due to the filtering occurring earlier and reducing the data in the join? When would this "where" option for create table be the preferred option?
proc sql;
create table (where=(field1=1))
select a.field1, a.field2
from table1 a
join table2 b
on a.field1 = b.field1;
quit;
proc sql;
create table
select a.field1, a.field2
from table1 a
join table2 b
on a.field1 = b.field1;
where a.field1 = 1;
quit;
Upvotes: 2
Views: 147
Reputation: 3845
The data set option (where=(...))
can only put conditions on that table.
select man.name, woman.name
from man(age > 60), woman(age between 50 and 60)
where man.address = woman.address;
The SQL where
clause applies to all tables involved in the query, hence if two tables have fields with identical names you have to qualify that name.
select man.name, woman.name
from man, woman
where man.address = woman.address
and man.age > 60
and woman.age between 50 and 60;
Logically, the dataset option is applied to the data table before it is involved in the query and the result of that condition is stored in a temporary table, which could be huge. This would mean the dataset option is slower.
In practice however, all SQL goes through an "optimizer" before it is applied to the data.
So what is more efficient, depends on the library engine.
A particular case is, when one table is on a database and another one in SAS base, or both tables are in different databases. Then the data of each table is retrieved first, then joined in SAS. It might wel be, that then a condition in the dataset options is already applied on the database and a condition in the where clause is only applied in SAS. This would mean the dataset option is faster;
Upvotes: 3
Reputation: 12909
This is one of those situations that is great for real-world testing. Let's try it. We'll create a larger table of 500K rows so we can more easily see differences in performance, if any. We'll loop through the code 100 times and check the average time and standard deviation.
%macro bigloop(n);
%do i = 1 %to &n;
sashelp.cars
%end;
%mend;
data table1;
set %bigloop(1500);
if(_N_ LE 500000);
run;
data table2;
set sashelp.cars;
where make IN('BMW', 'Audi', 'Porsche');
run;
%macro timeit;
%do i = 1 %to 100;
%let start=%sysfunc(datetime());
proc sql noprint;
create table test(where=(make = 'BMW')) as
select t1.make, t2.model
from table1 as t1
join table2 as t2
on t1.make = t2.make
;
quit;
%let time1 = %sysevalf(%sysfunc(datetime())-&start);
%let start=%sysfunc(datetime());
proc sql noprint;
create table test2 as
select t1.make, t2.model
from table1 as t1
join table2 as t2
on t1.make = t2.make
where t1.make='BMW'
;
quit;
%let time2 = %sysevalf(%sysfunc(datetime())-&start);
data time;
time1 = &time1;
time2 = &time2;
run;
proc append base=times data=time;
run;
%end;
proc sql;
select mean(time1) as avg_time1 label='Avg: Where option'
, std(time1) as std_time1 label='Std: Where option'
, mean(time2) as avg_time2 label='Avg: Where clause'
, std(time2) as std_time2 label='Std: Where clause'
from times;
quit;
proc datasets lib=work nolist;
delete times;
quit;
%mend;
%timeit;
In my test, where
as a clause in SQL was 0.33 seconds vs. 0.61 seconds in create table
, and it even has a lower standard deviation so it is more consistent. That's nearly half the time! So, your hypothesis is correct: SQL in the where
clause is more efficient than as an output dataset option.
Upvotes: 2