Reputation: 215
I have 2 datasets: t.a with 390K rows and 1 variable and t.b with 60 million rows and 350 variables. I need to join this datasets quickly, but my query is too slow.
How I can optimize query?
My query:
proc sql;
create table с as
select distinct a.REP_CLID, b.REP_DATE, &Score_Column, b.REP_AGE as AGE
from a (IDXWHERE =Yes) ,
&b (IDXWHERE =Yes)
where a.rep_clid = b.rep_clid
Upvotes: 0
Views: 3570
Reputation: 9569
Since you already have an index on rep_clid in your large table b, this seems like a good candidate for a data step key merge. Tweak as required so you're just keeping variables of interest:
data c;
set a;
set b key = rep_clid; /*requires unique index on rep_clid to work properly*/
if _IORC_ then do;
_ERROR_ = 0;
delete;
end;
run;
That will return only records with rep_clid present in both a and b. You can then deduplicate via proc sort with the nodupkey option.
If you have a non-unique index on b, it can still be made to work, but the syntax is a bit more complex:
data c;
set a;
do until(eof);
set b key = rep_clid end = eof; /*will work with non-unique index on rep_clid*/
if _IORC_ then do;
_ERROR_ = 0;
delete;
end;
else output;
end;
run;
Upvotes: 1
Reputation: 151
Performance can be a tricky issue as there a huge number of factors that can affect it. It is often a case of trying numerous ways to achieve the same thing until you find the best performing method.
Are both of these tables SAS tables or one or both third party DBMS tables? This opens up a whole world of performance issues I will leave until confirmed.
Assuming they are both SAS tables try re-writing your query like this if you only want columns from table B, assuming &Score_Column is in table B. If not then this will not work.
proc sql;
create table с as
select distinct b.REP_CLID, b.REP_DATE, &Score_Column, b.REP_AGE as AGE
from &b (IDXWHERE =Yes) as b
where b.rep_clid in
( select a.rep_clid
from a (IDXWHERE =Yes)
)
;
Quit;
Alternatively you could use a proc format as has been suggested. This example will work if &Score_Column is in table a but could be easily modified if it is not.
Proc sql;
create table rep_clid_fmt as
select distinct 'rep_clid_fmt' as fmtname
, rep_clid as start
/* If &Score_Column is in table a then use &Score_Column as the label */
, &Score_Column as label
else use a flag like...
, 'keep' as label
from a
;
Quit;
Proc format cntlin=rep_clid_fmt;
Run;
proc sql;
create table с as
select distinct b.REP_CLID
, b.REP_DATE
, put (b.REP_DATE,rep_clid_fmt) as &Score_Column
, b.REP_AGE as AGE
from &b (IDXWHERE =Yes) as b
where put (b.REP_DATE,rep_clid_fmt) ne substr (b.REP_DATE,1,length(put(b.REP_DATE,rep_clid_fmt))
;
Quit;
Good luck!
Upvotes: 0