Reputation: 4674
I have a question about the following 2 codes in SAS PROC SQL.
Code 1: (Standard Book version)
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (SELECT CLAIMID FROM WORK.INPUT)
Code 2: (The much faster way in practice)
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN ('10001', '10002', '10003', ... '15000')
When I try to do it more elegantly by using subquery in #1, the run time blows up to 50 minutes +. But the same input returns within 3 minutes using Code 2. Why is that? Note, it's just as slow using INNER JOIN too (after reading this). The input is 5000+ CLAIMID, which I manually paste into the IN('...') block everyday.
PS: The CLAIMID are made up, in real life they are random.
The CLAIMID are indexed in DW.CLAIMS. I am using SAS PROC SQL to access an Oracle database. What is going on, and is there a better way? Thanks!
Upvotes: 2
Views: 990
Reputation: 95
You can also save the In() values in a table and just do a join.
PROC SQL;
/*CLAIM ID Table*/
CREATE TABLE WORK.OUTPUT1 AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001';
/*ID Lookup Table*/
CREATE TABLE WORK.OUTPUT2 AS
SELECT
DISTINCT b.CLAIMID FROM WORK.INPUT AS b
;
/*Inner Join Table / AKA lookup join*/
CREATE TABLE WORK.Final AS
SELECT
a.SOURCE, a.CLAIMID, a.DXCODE
FROM WORK.OUTPUT1 AS a INNER JOIN WORK.OUTPUT2 AS b
ON a.CLAIMID = b.CLAIMID
;
QUIT;
Upvotes: 1
Reputation: 4792
Please be sure to use
option sastrace=',,,ds' sastraceloc=saslog nostsuffix;
to receive information on how your code is translated by SAS/Aceess engine to DB statements.
In order to give SAS a hint to dynamicly build IN (1,2,3, ..)
clause from your IN (SELECT ..
query
MULTI_DATASRC_OPT=IN_CLAUSE
to your libname DW ...
statement and dbmaster
dataset option to the "master" tablelike one of the following queries:
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV (dbmaster=yes) AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (SELECT CLAIMID FROM WORK.INPUT)
or
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV (dbmaster=yes) AS a
inner join WORK.INPUT AS b
on a.CLAIMID = b.CLAIMID
WHERE
a.SITEID = '0001'
Upvotes: 2
Reputation: 1804
Using the In() without sub-querying is definitely faster, but other performance consideration to keep in mind is the network and compute server load/traffic at the time of running; assuming you are running on a client / server configuration.
If you plan to use the SQL select into macro variable solution; keep in mind the count of distinct values and the length of the string you are saving in the macro as there is a size limit.
Upvotes: 1
Reputation: 63424
I don't know that I can tell you why SAS is so slow at the first select; something's not optimized in that scenario clearly.
If I had to guess, I'd guess that SAS is deciding in the first case that it can't use pass-through SQL and so it's downloading the whole big table and then running this SAS-side, while in the second case it's passing the query up to the SQL database and only transporting the resulting rows back.
But there are several ways to work around this, anyway. Here's one: use a macro variable to do precisely the pasting you're doing!
proc sql;
select quote(strip(claimid)) into :claimlist separated by ','
from work.input
;
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (&claimlist.)
;
quit;
Tada, you don't have to touch this anymore, and it's identical to the copy/paste that you did.
A few extra notes given some comments:
If CLAIMID is ever less than 15, you may have space padding, so I added strip
to remove those. It doesn't matter for string comparisons - except insomuch as you might run out of macro language, and I worry that some DBMS may actually care about the padding. You can leave out strip
if the 15 is a constant length.
Macro variables run up to 64K in space. If you have 15 character variable plus " " two plus comma one, you have 18 characters; you have room for a bit over 3500 values. That's under 5000, unfortunately.
In this case, you can either split up the field into two macro variables (easy enough hopefully, use obs
and firstobs
) or you can do some other solution.
work.input
dataset into the DW
libname, then do the join in SQL there.%include
that file.call execute
to execute the whole proc SQL.Here's one example of CALL EXECUTE.
data _null_;
set work.input end=eof;
if _n_=1 then do;
call execute('CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM
DW.CLAIMS_BAV AS a
WHERE
a.SITEID = "0001"
AND a.CLAIMID IN ('); *the part of the SQL query before the list of IDs;
end;
call execute(quote(claimID) || ' ');
if EOF then do;
call execute('); QUIT;'); *the part of the SQL query after the list of IDs;
end;
run;
This would be nearly identical to the %INCLUDE
solution really, except there you put
that stuff to a text file instead of CALL EXECUTE
ing it, and then you %INCLUDE
that text file.
Upvotes: 4
Reputation: 21264
I think you're working both with local data and data on your server. When SAS is working with data from different sources (databases) it brings it all into SAS for processing which can be really, really slow.
Instead, you can make a macro variable and use that within your query. If it's 5000, it should fit into one macro variable, assuming the length is less than 13 chars each. A macro variable size limit is 64K characters, so it depends on the length of the variable. If not you could create a macro instead.
proc sql noprint;
select quote(claimID, "'") into : claim_list separated by ", " from input;
quit;
proc sql;
CREATE TABLE WORK.OUTPUT AS
SELECT
"CLAIM" AS SOURCE,
a.CLAIMID,
a.DXCODE
FROM DW.CLAIMS_BAV AS a
WHERE
a.SITEID = '0001'
AND a.CLAIMID IN (&claim_list.);
quit;
Upvotes: 3