Alph
Alph

Reputation: 391

explanation of nested dow-loop in SAS

I am wondering how does the pseudo-code below works? I appreciate if there is a worked example.

data want;
  do until (last.var1);
    do until (last.var2);
      set have;

      * other sas statements;

    end;
 end;
run;

Upvotes: 2

Views: 424

Answers (1)

Joe
Joe

Reputation: 63424

Basically, a single DoW loop lets you perform particular actions after each by variable boundary, and have slightly different timings than a normal data step (Which can be helpful, or not). So given this set::

data have;
  input x y z;
  datalines;
1 1 1
1 1 2
1 2 1
1 2 2
2 1 1
2 1 2
2 2 1
2 2 2
;;;;
run;

Here is the normal data step:

data want;
  set have;
  by x;
  if first.x then do;
    put "First value of " x=;
  end;
  put _all_;
  if last.x then do;
    put "Last value of " x=;
  end;
run;

And here is the DoW:

data want_dow;
  put "First value of " x=;
  do _n_ = 1 by 1 until (last.x);
    set have;
    by x;
    put _all_;
  end;
  put "Last value of " x=;
run;

Notice that it has slightly different results - for the first iteration, and the final iteration, and it outputs different rows. That is because SAS is doing all of that stuff for us automatically in the first method, while DoW loop you have to do it yourself (you'd have to put an OUTPUT statement in there, for example, if you want all 8, and you'd have to test for EOF andSTOP` if true).

But perhaps this is what you want - you want there to be no value at first, then you want to do something. That's when DoW loop is useful.

A nested DoW loop is identical, just there are two different points you can take action at. Notice that it doesn't actually change how the rows are read in: every time you run into that set statement the next row is read (whatever that row is) from the dataset. Same order, just you have more stopping points to let you write code.

data want;
  set have;
  by x y;
  if first.x then do;
    put "First value of " x=;
  end;
  if first.y then do;
    put "First value of " y=;
  end;
  put _all_;
  if last.y then do;
    put "Last value of " y=;
  end;
  if last.x then do;
    put "Last value of " x=;
  end;

run;

data want_dow;
  put "First value of " x=;
  do _n_ = 1 by 1 until (last.x);
    put "First value of " y=;
    do _n_ = 1 by 1 until (last.y);
      set have;
      by x y;
      put _all_;
    end;
    put "Last value of " y=;
  end;
  put "Last value of " x=;
run;

Again, you have differences here because DoW loop "first" takes action before the first row is read - which again can be helpful, or not helpful, depending on your use case. I don't think I've ever had a use case for this, but it's certainly not impossible.

Here's a useful case for example where you are basically doing a PROC MEANS by hand. It can be done both ways of course; some will prefer each.

data want_dow;    
  do _n_ = 1 by 1 until (last.x);
    do _n_ = 1 by 1 until (last.y);
      set have;
      by x y;
      z_sum_y = sum(z_sum_y,z);
      z_sum_x = sum(z_sum_x,z);
    end;
    z_sum = z_sum_y;
    output;
    call missing(z_sum_y);
  end;
  call missing(y);
  z_sum = z_sum_x;
  output;
  drop z_sum_y z_sum_x;
run;

data want;
  set have;
  by x y;
  z_sum_y+z;
  z_sum_x+z;
  if last.y then do;
    z_sum = z_sum_y;
    output;
    z_sum_y=0;
  end;
  if last.x then do;
    z_sum = z_sum_x;
    call missing(y);
    output;
    z_sum_x=0;
  end;
  drop z_sum_y z_sum_x;
run;

Mostly, though, DoW loop is most useful for the Double DoW loop, which is useful for summarization and then reading that summarized value on the same data step iteration. This is the same summarization but allows you to see the value on the current row. If you want to see the differences, change the Z values in have to something else (I put them to 1/2 in patterns intentionally so it makes it easier to check things).

data want_ddow;
  array z_sum_ys[2] _temporary_;
  do _n_ = 1 by 1 until (last.x);
    do _n_ = 1 by 1 until (last.y);
      set have;
      by x y;
      z_sum_ys[y] = sum(z_sum_ys[y],z);
      z_sum_x = sum(z_sum_x,z);
    end;
  end;

  do _n_ = 1 by 1 until (last.x);  *do not need nesting here;
    set have;
    by x y;
    z_sum_y = z_sum_ys[y];
    output;
  end;
  call missing(of z_sum_ys[*] z_sum_x);
run;

To do that without the Double DoW loop, you'd have to merge the results from the first want back to have. That's not necessarily a big deal, but it is a second pass on the data; the Double DoW loop takes advantage of buffering to avoid actually re-doing the I/O of the second read in.

Upvotes: 7

Related Questions