Olga
Olga

Reputation: 1

Optimizing of SQL-select in SAS

I have 2 following datasets, first of them is a subset of the second:

data records;
input a $1. b $1. c $10.;
datalines;
111
333
;
run;`

data all_records;
input a $1. b $1. c $10.;
datalines;
111
222
333
443
553#3
666
;
run;

I want to get these strings from the second dataset as result:

111
333
443
553#3

It's is a result of the SQL-select:

proc sql;
    create table result as
    select t2.*
    from
    records t1
    inner join all_records t2
    on ((t1.a=t2.a and t1.b=t2.b) or 
        (t1.c=t2.c and t1.c^='') or 
        t2.c=catx('#',t1.a,t1.b) or 
        t1.c=catx('#',t2.a,t2.b)
       )
    ;
quit;

The problem is that this SQL-select works for a long time (about 3.5 hours), if source datasets have more strings (records has about 50 000 strings, all_records has 2.5 million of strings), because a execution of this query involves performing a lot of Cartesian joins.

I want to optimize this SQL-select and write code in SAS.

Any suggestions on how to solve the problem. Thank you!

Upvotes: 0

Views: 146

Answers (3)

Olga
Olga

Reputation: 1

Many thanks for help to all of you!! I've decided to use SQL variant, it works for 56 seconds with big data.

Upvotes: 0

Parfait
Parfait

Reputation: 107747

Another classic SQL performance debate: UNION vs OR. Consider using a UNION query, separating each OR condition in its own SELECT statement in the WHERE clause. The OR is considered a conditional operator and UNION a set operator and handles indexing differently. You will also notice I removed the explicit join for an implicit join mainly for readability since no true IDs are linking tables.

Depending on the query optimizer, performance of either operator can vary. According to SAS's log, comparing your posted data and query, real and cpu time processing was reduced by more than half:

proc sql;
    create table result2 as
    select t2.*
    from
    records t1, all_records t2
    where(t1.a=t2.a and t1.b=t2.b) 

    union

    select t2.*
    from
    records t1, all_records t2
    where (t1.c=t2.c and t1.c^='') 

    union

    select t2.*
    from
    records t1, all_records t2
    where t2.c=catx('#',t1.a,t1.b) 

    union

    select t2.*
    from
    records t1, all_records t2
    where t1.c=catx('#',t2.a,t2.b);
quit;

Upvotes: 0

Joe
Joe

Reputation: 63434

I like the hash solution to this. In theory SQL should also be able to do this, but it can be hard for SQL to tell that it's possible when you have complicated join conditions.

I add a dummy z variable, because if you don't define anything for data, it apparently will use the hash keys as data - which is bad in the case of the a#b lookup.

If you want the a#b lookup to go the other direction (you don't have data showing that possible) then you need to reverse that - have a lookup in hC of key:catx('#',a,b).

Otherwise you have a pretty straightforward use of hash here (Except you have to load the hash twice, once for each set of unique keys).

data records;
input a $1. b $1. c $10.;
z=' ';
datalines;
111
333
;
run;

data all_records;
input a $1. b $1. c $10.;
datalines;
111
222
333
443
553#3
666
;
run;

data want;
  if 0 then set records;
  if _n_ = 1 then do;
    declare hash hAB(dataset:'records');
    hAB.defineKey('a','b');
    hAB.defineData('z');
    hAB.defineDone();
    declare hash hC(dataset:'records');
    hC.defineKey('c');
    hC.defineData('z');
    hC.defineDone();  
  end;
  set all_Records;
  rcAB = hAB.find();
  rcC  = hC.find();
  if find(c,'#') then do;
    rcAPB = hAB.find(key:scan(c,1,'#'),key:scan(c,2,'#'));
  end;
  if rcAB eq 0 or rcC eq 0 or rcAPB eq 0;
run;

Upvotes: 2

Related Questions