Reputation: 1
I have 2 following datasets, first of them is a subset of the second:
data records;
input a $1. b $1. c $10.;
datalines;
111
333
;
run;`
data all_records;
input a $1. b $1. c $10.;
datalines;
111
222
333
443
553#3
666
;
run;
I want to get these strings from the second dataset as result:
111
333
443
553#3
It's is a result of the SQL-select:
proc sql;
create table result as
select t2.*
from
records t1
inner join all_records t2
on ((t1.a=t2.a and t1.b=t2.b) or
(t1.c=t2.c and t1.c^='') or
t2.c=catx('#',t1.a,t1.b) or
t1.c=catx('#',t2.a,t2.b)
)
;
quit;
The problem is that this SQL-select works for a long time (about 3.5 hours), if source datasets have more strings (records has about 50 000 strings, all_records has 2.5 million of strings), because a execution of this query involves performing a lot of Cartesian joins.
I want to optimize this SQL-select and write code in SAS.
Any suggestions on how to solve the problem. Thank you!
Upvotes: 0
Views: 146
Reputation: 1
Many thanks for help to all of you!! I've decided to use SQL variant, it works for 56 seconds with big data.
Upvotes: 0
Reputation: 107747
Another classic SQL performance debate: UNION vs OR. Consider using a UNION query, separating each OR condition in its own SELECT
statement in the WHERE
clause. The OR
is considered a conditional operator and UNION
a set operator and handles indexing differently. You will also notice I removed the explicit join for an implicit join mainly for readability since no true IDs are linking tables.
Depending on the query optimizer, performance of either operator can vary. According to SAS's log, comparing your posted data and query, real and cpu time processing was reduced by more than half:
proc sql;
create table result2 as
select t2.*
from
records t1, all_records t2
where(t1.a=t2.a and t1.b=t2.b)
union
select t2.*
from
records t1, all_records t2
where (t1.c=t2.c and t1.c^='')
union
select t2.*
from
records t1, all_records t2
where t2.c=catx('#',t1.a,t1.b)
union
select t2.*
from
records t1, all_records t2
where t1.c=catx('#',t2.a,t2.b);
quit;
Upvotes: 0
Reputation: 63434
I like the hash solution to this. In theory SQL should also be able to do this, but it can be hard for SQL to tell that it's possible when you have complicated join conditions.
I add a dummy z variable, because if you don't define anything for data, it apparently will use the hash keys as data - which is bad in the case of the a#b lookup.
If you want the a#b lookup to go the other direction (you don't have data showing that possible) then you need to reverse that - have a lookup in hC of key:catx('#',a,b).
Otherwise you have a pretty straightforward use of hash here (Except you have to load the hash twice, once for each set of unique keys).
data records;
input a $1. b $1. c $10.;
z=' ';
datalines;
111
333
;
run;
data all_records;
input a $1. b $1. c $10.;
datalines;
111
222
333
443
553#3
666
;
run;
data want;
if 0 then set records;
if _n_ = 1 then do;
declare hash hAB(dataset:'records');
hAB.defineKey('a','b');
hAB.defineData('z');
hAB.defineDone();
declare hash hC(dataset:'records');
hC.defineKey('c');
hC.defineData('z');
hC.defineDone();
end;
set all_Records;
rcAB = hAB.find();
rcC = hC.find();
if find(c,'#') then do;
rcAPB = hAB.find(key:scan(c,1,'#'),key:scan(c,2,'#'));
end;
if rcAB eq 0 or rcC eq 0 or rcAPB eq 0;
run;
Upvotes: 2