Lovnlust
Lovnlust

Reputation: 1537

Update table from other tables in SAS

I have a daily updated dataset on server named IOA.HAVE which contains columns Date, Area, LocnID, ATTR1, ATTR2, ... ATTR10. To simply the problem, let's say

Date    Area LocnID A1 A2
01Nov14 AAA  100000 50 G
01Nov14 AAA  100001 30 G
01Nov14 AAA  100002 30 K
01Nov14 BBB  100003 20 K
02Nov14 CCC  100009 30 C  
02Nov14 AAA  100000 50 G

For every specific Date, LocnID is unique.

Another local file (.xlxs) named Adjustment will be import into SAS via proc import everyday.

Date    Area LocnID A1 A2 Type
02Nov14 BBB  100000 50 G  change
02Nov14 CCC  100009 30 C  close
03Nov14 DDD  200000 20    open

Its columns are similar to those in HAVE except Type.

If Type = change, then it means staring from that day, all attributes in HAVE with the LocnID should be replaced by those in Adjustment. If Type = close, then staring from that day, all records with the LocnID should be deleted from HAVE. If Type = open, then staring from that day, add new records into HAVE.

So after adjustment, the IOA.HAVE should be

Date    Area LocnID A1 A2
01Nov14 AAA  100000 50 G
01Nov14 AAA  100001 30 G
01Nov14 AAA  100002 30 K
01Nov14 BBB  100003 20 K
02Nov14 BBB  100000 50 G    /* change Area */
....
....
03Nov14 DDD  200000 20      /* open */

Currently I do it like this

data t1 t2;
set adjustment;
if type in ('change','close') then output t1;
if type in ('change','open') then output t2;
run;

proc sql;
create table a1 as 
select * from `IOA.HAVE1 as a
where not exists (select * from work.t1 where a.Date >= t1.Date and a.LocnID = t1.LocnID)
union
select * from t2 
where t2.Date <= today()
order by Date, LocnID;
quit;

But this is very inefficient. How to optimize this (better done in a more 'SAS' way instead of 'SQL')?

Upvotes: 1

Views: 204

Answers (1)

Joe
Joe

Reputation: 63424

If your tables are both sorted by date locnID, then your method is not a bad one. You might consider changing the exists query to a join, but I think SQL should optimize them to the same query at the end of the day (but check!). I'd say this is a fairly standard way of doing add/update/delete transactions: delete (update+delete), then insert (update+add). Actually doing deletes will be more efficient than recreating the table, though.

Interestingly enough, adding an index is bad for this query - very bad. I ran your query on sorted tables without indexes, ~30 seconds (table sizes ~500k for have, ~1k for adjustment). 3 minutes with index. The problem is the date greater than; it's pretty bad when used with indexes and very common to see indexes hurting queries like this.

In SAS, you might be able to do this easily with a MERGE. I don't think MODIFY is the right route, because we're doing some things that would be hard to replicate in a MODIFY statement. It might work, but I find MERGE easier to code here.

Here is my example, and I include your SQL code below as a comparison. There is one major difference; in my SAS code I don't delete the changes, which means that I get more rows than you: yours deletes all rows after the change. You can do that in the SAS method also, but your english language version of the spec sounds like that's not really desired.

First, I just create the tables to test with. This is around half a million rows in HAVE and a thousand adjustments. You probably have more, or you wouldn't be asking, but this should give an idea of speed.

data have;
  array locnIDs[1000] _Temporary_;
  do _t = 1 to dim(locnIDs);
    locnIDs[_t] = 1;
  end;
  call streaminit(7);
  do date = '01NOV2011'd to '01DEC2014'd;
    do _t = 1 to dim(locnIDs);
      if rand('Uniform') < 0.02 then locnIDs[_t]=not(locnIDs[_t]);
      if locnIDs[_t] then do;
        locnID = 100000+_t;
        A1=2**_t;
        A2=byte(mod(_t,26)+65);
        Area = repeat(byte(mod(_t,26)+65),2);
        output;
      end;
    end;
  end;
run;

data adjustment;
  call streaminit(7);
  set have;
  by date locnID;
  length type $6;
  retain lastlocnID;
  if (first.date) then lastlocnID=.;
  if lastlocnID gt 0 and locnID-lastlocnID gt 1 then do;
    if rand('Uniform') lt 0.001 then do;       
        locnID=locnID-1;
        type='open';
        output;
    end;
    else if rand('Uniform') lt 0.001 then do;
        type='close';
        output;
    end;
  end;
  else do;
    if rand('Uniform') lt 0.001 then do;
        type='change';
        Area = '###';
        output;
    end;
  end;
  lastlocnID=locnID;
run;

Next I split into t1/t2 and do the SQL method you have above.

data t1 t2;
  set adjustment;
  if type in ('change','close') then output t1;   *t1 is changes/deletes;
  if type in ('change','open') then output t2;    *t2 is all adds/changes;
run;
proc sql;
create table a1 as 
  select * from HAVE as a
    where not exists (select * from work.t1 where a.Date >= t1.Date and a.LocnID = t1.LocnID)
  union
  select * from t2 
    where t2.Date <= today()
  order by Date, LocnID;
quit;

Now here is the SAS merge - much faster. Basically I sort by locnid date so that changes can be propagated for a location across dates, then use temporary storage variables to store the change values that need to be propagated down, and another to identify deletes. Note that a lot more records exist because I'm propagating the change down, not deleting all of the later records that are changed. If you want to delete them, this method still works but is a lot easier (you can skip all of the __a1=a1 and a1=__a1 stuff, and then assign a __delflag=1 after the output statement.)

proc sort data=have;
  by locnid date;
run;

proc sort data=adjustment;
  by locnid date;
run;

data have2;
  merge have(in=_h) adjustment(in=_a);
  by locnid date;
  retain __delflag __a1 __a2 __area;
  if first.locnid then do;   *clear the flags;
    call missing(of __delflag __a1 __a2 __area);
  end;
  if not _a and not missing(__a1) then do;  *if a previous change is pending;
    a1=__a1;
    a2=__a2;
    area=__area;
  end;
  if _a and _h and type='change' then do;  *if a change is implemented;
    __a1=a1;
    __a2=a2;
    __area=area;
    __delflag=0;  *Not sure if this should be possible, but in case;
  end;
  else if _a and _h and type='close' then do;  *if a delete is needed;
    __delflag=1;
  end;

  if __delflag=1 then delete;
  output;
run;

This runs in a fraction of a second thanks to SAS not having to go back in time and check each record. I'm not sure this exactly replicates what you want, but it should do something approximately close to it.

The exact replica of a1 in SAS is this:

data have3;
  merge have(in=_h) adjustment(in=_a);
  by locnid date;
  retain __delflag;
  if first.locnid then do;
    call missing(__delflag);
  end;
  if _a and _h and type = 'close' then do;
    __delflag=1;
  end;
  if __delflag=1 and not (type in ('open','change')) then delete;
  output;
  if _a and _h and type='change' then __delflag=1;
  if last.locnid then do;
    call missing(of __:);
  end;
run;

I remove the propagating parts, delete if change after that record, and add a few parameters to the delete to avoid deleting later open/change records. (Those may not be possible in real data, this is a replica using my fake data that can have opens/changes after a close.)

Upvotes: 2

Related Questions