user3306125
user3306125

Reputation: 215

How to optimize proc sql join in SAS?

I have 2 datasets: t.a with 390K rows and 1 variable and t.b with 60 million rows and 350 variables. I need to join this datasets quickly, but my query is too slow.

How I can optimize query?

My query:

    proc sql;
    create table с as 
    select distinct a.REP_CLID, b.REP_DATE, &Score_Column, b.REP_AGE as AGE 
    from a (IDXWHERE =Yes) ,
    &b (IDXWHERE =Yes)  
    where a.rep_clid = b.rep_clid 

Upvotes: 0

Views: 3570

Answers (2)

user667489
user667489

Reputation: 9569

Since you already have an index on rep_clid in your large table b, this seems like a good candidate for a data step key merge. Tweak as required so you're just keeping variables of interest:

data c;
  set a;
  set b key = rep_clid; /*requires unique index on rep_clid to work properly*/
  if _IORC_ then do;
    _ERROR_ = 0;
    delete;
  end;
run;

That will return only records with rep_clid present in both a and b. You can then deduplicate via proc sort with the nodupkey option.

If you have a non-unique index on b, it can still be made to work, but the syntax is a bit more complex:

data c;
  set a;
  do until(eof);
    set b key = rep_clid end = eof; /*will work with non-unique index on rep_clid*/
    if _IORC_ then do;
      _ERROR_ = 0;
      delete;
    end;
    else output;
  end;
run;

Upvotes: 1

Chris
Chris

Reputation: 151

Performance can be a tricky issue as there a huge number of factors that can affect it. It is often a case of trying numerous ways to achieve the same thing until you find the best performing method.

Are both of these tables SAS tables or one or both third party DBMS tables? This opens up a whole world of performance issues I will leave until confirmed.

Assuming they are both SAS tables try re-writing your query like this if you only want columns from table B, assuming &Score_Column is in table B. If not then this will not work.

proc sql; 
    create table с as 
    select distinct b.REP_CLID, b.REP_DATE, &Score_Column, b.REP_AGE as AGE 
    from   &b   (IDXWHERE =Yes) as      b   
    where   b.rep_clid  in 
            (   select  a.rep_clid 
                from    a (IDXWHERE =Yes) 
            ) 
    ; 
Quit; 

Alternatively you could use a proc format as has been suggested. This example will work if &Score_Column is in table a but could be easily modified if it is not.

Proc sql; 
        create table    rep_clid_fmt    as 
        select  distinct        'rep_clid_fmt'  as fmtname 
        ,       rep_clid                        as      start 

/*              If &Score_Column  is in table a then use &Score_Column  as the label */ 
        ,       &Score_Column                   as      label 

                else use a flag like... 
        ,       'keep'                          as      label 

        from    a 
        ; 
Quit; 

Proc format cntlin=rep_clid_fmt; 
Run; 


proc sql; 
    create table с as 
    select distinct b.REP_CLID 
        ,  b.REP_DATE 
        , put (b.REP_DATE,rep_clid_fmt) as      &Score_Column 
        , b.REP_AGE as AGE 
    from   &b   (IDXWHERE =Yes) as      b   
    where   put (b.REP_DATE,rep_clid_fmt)       ne substr (b.REP_DATE,1,length(put(b.REP_DATE,rep_clid_fmt)) 
    ; 
Quit; 

Good luck!

Upvotes: 0

Related Questions