EmShing
EmShing

Reputation: 29

Looping with PROC SQL

I have a dataset that looks like this (1 to 5 are months):

date acct seg_1 seg_2 seg_3 seg_4 seg_5 
1/20  A     x     x     x     x     y
1/20  B     x     y     x     x     y
1/20  C     y     y     x     x     x


date acct abc_1 abc_2 abc_3 abc_4 abc_5 
1/20  A     0     0     1     1     1
1/20  B     1     1     0     1     1
1/20  C     1     0     1     0     1

The idea is that I want to count the number of accounts if it satisfies the condition of abc_(t)=0 and abc_(t+1)=1 for each segment columns. Below is my code that gives me the results I wanted without looping:

proc sql;
    create table want_1 as
    select distinct seg_1 as segment, (count(acct)) as count_1
    from have
    where abc_2 = 0 and abc_3 = 1
    group by seg_1;

    create table want_2 as
    select distinct seg_2 as segment, (count(acct)) as count_2
    from have
    where abc_3 = 0 and abc_4 = 1
    group by seg_2;

    create table want_3 as
    select distinct seg_3 as segment, (count(acct)) as count_3
    from have
    where abc_4 = 0 and abc_5 = 1
    group by seg_3;
quit;

However, I would like to embed a macro as I have 84 months ie. code to run and to combine all months after. Would appreciate the help on how to fix my failed code below:

%macro loop(a,b);
    proc sql;
        %do x=&a. %to &b.;
            %do i=&a.+1 %to &b.+1;
                %do j=&a.+2 %to &b.+2;
                    create table want_&x. as
                    select distinct seg_&x. as segment, count(acct) as count_&x.
                    from have
                    where abc_&i. = 0 and abc_&j. = 1
                    group by seg_&x.;
                %end;
            %end;
        %end;
    quit;
%mend;
%loop(a=1,b=84);

Ideally, the combined results (using Segment as the unique identifier) should look like this:

Segment  Count_1 Count_2 Count_3
   x        1       0       1
   y        1       1       0

Note: tried transposing my data but it has over 41 million rows. Appreciate if someone could suggest a data step code as an alternative too!

Upvotes: 0

Views: 3036

Answers (2)

Richard
Richard

Reputation: 27498

Suppose the abc_* variable names are changed to flag_*.

Two data step approaches:

  1. Use arrays to organize segment and flag variables and compare flag(<i>) to flag(<i+1>) in a loop to detect each case of month to month flags transitioning from 0 to 1. In signal processing this is a rising edge and suggests the second way.

  2. Use many statements (macro generated) with bit mask test expressions to locate the rising edge at each place a 01 could occur when the month flags are organized as a collection of bits.

In both cases I would use a HASH to accumulate the distinct combinations of:

  • account (how ever many there are),
  • segment (how ever many there are) and
  • position (month 1 to 83).

and post process to get the account count per segment & position

Note: The worst case scenario of 41 rising edges in every row, combined with an approach that does not check for repeats, the output table for post processing would approach 1/2 the size of a TRANSPOSE output table. And that can be pretty big.

The execution time of the code will likely be quite high due to the 83 pairwise 'checks' per row and the amount of i/o needed save the data from the rising edge cases.

Examples:

* Simulate some data, segment values are numeric for example code;

%* These parameters create 128K rows;

%let N_ACCOUNTS      = 1500;
%let N_DATES         =  208;
%let N_SEGMENT_TYPES = 1000;
%let N_MONTHS        =   84;

data have(keep=account_id date segment_: flag_: seek);
  call streaminit(&SEED);

  do account_id = 1 to 600;
    date = '08JAN2015'd - 7;
    do dix = 1 to &N_DATES;   %* once a week, so &N_DATES weeks;
      date + 7;

      %* _flags is appropriate for up to &N_MONTHS = 128 ;

      _flags = put(rand('uniform',2**32), binary32.)
            || put(rand('uniform',2**32), binary32.)
            || put(rand('uniform',2**32), binary32.)
            || put(rand('uniform',2**32), binary32.)
      ;  
      %* NOTE: edges are in this random data and will need to be counted;

      array segment segment_1-segment_&N_MONTHS;
      array flag_(&N_MONTHS);

      do over segment;
        %* guarded simulation of a 1 of &N_SEGMENT_TYPES segment value;
        do seek = 1 to 1e5 until (1 <= segment <= &N_SEGMENT_TYPES);
          segment = ceil(rand('WEIBULL', 1.15, &N_SEGMENT_TYPES / 5));
        end;
        flag_[_i_] = substr(_flags,_i_,1) = '1';
      end;

      OUTPUT;
    end;
  end;

  format flag_: 1. date yymmdd10.;
run;

* Way 1;
* Use arrays to locate the rising edge condition at each position and a HASH to store the unique combinations;

data _null_;
  length segment account_id position 8;
  call missing (account_id, segment, position);

  declare hash combination(ordered:'A');
  combination.defineKey('account_id', 'segment', 'position');
  combination.defineDone();

  do until (last_row);
    set have end=last_row;

    array segments segment_:;
    array flags flag_:;

    do position = 1 to dim(flags) - 1;
      if flags(position) = 0 and flags(position+1) = 1 then do;
        segment = segments(position);
        %PUT always replace;
        combination.replace();
      end;
    end;
  end;

  combination.output(dataset:'combinations_array_way');
  stop;
run;

* Post process to get combination counts and transpose wide;
* Class data EACH_POSITION presumes no missing account_id in original data;

* data for a dummy account will ensure all 'positions' (months) are present in the desired order;
* the positions 1..N become column name suffixes in transposed data;
data each_position;
  retain account_id segment .;
  do position = 1 to &N_MONTHS-1;
    output;
  end;
run;

proc sql;
  create table summary_array_way as
  select segment, position, count(*) as count
  from 
    ( select * from combinations_array_way union all corresponding
      select * from each_position
    )
  group by segment, position;
quit;

proc transpose data=summary_array_way out=want_array_way(drop=_name_ where=(not missing(segment))) prefix=count_;
  by segment;
  id position;
  var count;
  format count 6.;
run;

* Way 2 (Novel);
* Use macro generated if statements that use bit mask tests to locate the rising edge condition at each position and a HASH to store the unique combinations;

data _null_;
  length segment account_id position 8;
  call missing (account_id, segment, position);

  %* track existance of a combination;
  %* code will add a hash key only at first occurrence of combination,
  %* thus ensuring unique account_ids over segment and position;

  declare hash combination(ordered:'A');
  combination.defineKey('account_id', 'segment', 'position');
  combination.defineDone();

  if 0 then set have; * prep PDV for arrays, and N_MONTHS verification;

  array segments segment_:;
  array flags flag_:;

  _n_ = dim(flags);
  if &N_MONTHS ne dim(flags) then do;
    put "ERROR: Macro symbol N_MONTHS=&N_MONTHS does not agree with the number of flag variables (" _n_ +(-1) ")";
    abort cancel;
  end;

  do until (last_row);
    set have end=last_row;

    %macro test_01(var,places,offset);
      %* code gen for if statements that use bit mask expressions;
      %local index mask;
      %do index = 1 %to &places;
        %let mask = %sysfunc(repeat(.,&index-1))01%sysfunc(repeat(.,&places));
        %let mask = %substr(&mask,2,&places+1);

        %* emit statement for mask test and hash update;

        if &var = "&mask."B then 
          rc = combination.replace(
            key: account_id, key: segment_%eval(&offset+&index), key:  %eval(&index+&offset),
            data:account_id, data:segment_%eval(&offset+&index), data: %eval(&index+&offset)
          );
      %end;
    %mend;

    options mprint nosymbolgen nomlogic;

    %macro test_flag_chunks();
      %* chunk concatenation of flag variables for bit mask evaluation;
      %local count L R B;

      %let L = 1;
      %let R = %sysfunc(MIN(64,&N_MONTHS));

      %do %while (&L < &N_MONTHS);
        %let B = %eval (&R - &L + 1); %* number of 0/1 flag variables to process as bits;

        %* concatenate up to 64 0/1 flag values to be a bit representation of a 8 byte character value;

        chunk = input ( cats ( of flag_&L - flag_&R ), $binary64. );

        %* code generate B if statements that utilize bit mask expressions
        % for testing for 01 at each position;

        %test_01 (chunk, %eval(&B-1), %eval(&L-1))

        %let L = %eval(&L + 63);
        %let R = %eval(&R + 63);
        %if &R > &N_MONTHS %then %let R = &N_MONTHS;
      %end;
    %mend;

    %test_flag_chunks
    *;

    %* Example of the code gen for 84 month use case;
    %* -------------------------------------------------;
    %* chunk = input(cats(of flag1-flag64),$binary64.);
    %* %test_01(chunk,63,0)
    %* chunk = input(cats(of flag64-flag84),$binary64.);    
    %* %test_01(chunk,20,63)                      
    %* -------------------------------------------------;

  end;

  combination.output(dataset:'combinations');
  stop;
run;

/* Partial log of code gen

MPRINT(TEST_FLAG_CHUNKS):   chunk = input ( cats ( of flag_64 - flag_84 ), $binary64. );
MPRINT(TEST_01):   if chunk = "01..................."B then rc = combination.replace( key: account_id, key: segment_64, key: 64, data:account_id, data:segment_64, data: 64 );
MPRINT(TEST_01):   if chunk = ".01.................."B then rc = combination.replace( key: account_id, key: segment_65, key: 65, data:account_id, data:segment_65, data: 65 );
MPRINT(TEST_01):   if chunk = "..01................."B then rc = combination.replace( key: account_id, key: segment_66, key: 66, data:account_id, data:segment_66, data: 66 );
MPRINT(TEST_01):   if chunk = "...01................"B then rc = combination.replace( key: account_id, key: segment_67, key: 67, data:account_id, data:segment_67, data: 67 );
MPRINT(TEST_01):   if chunk = "....01..............."B then rc = combination.replace( key: account_id, key: segment_68, key: 68, data:account_id, data:segment_68, data: 68 );
MPRINT(TEST_01):   if chunk = ".....01.............."B then rc = combination.replace( key: account_id, key: segment_69, key: 69, data:account_id, data:segment_69, data: 69 );
MPRINT(TEST_01):   if chunk = "......01............."B then rc = combination.replace( key: account_id, key: segment_70, key: 70, data:account_id, data:segment_70, data: 70 );
MPRINT(TEST_01):   if chunk = ".......01............"B then rc = combination.replace( key: account_id, key: segment_71, key: 71, data:account_id, data:segment_71, data: 71 );
MPRINT(TEST_01):   if chunk = "........01..........."B then rc = combination.replace( key: account_id, key: segment_72, key: 72, data:account_id, data:segment_72, data: 72 );
MPRINT(TEST_01):   if chunk = ".........01.........."B then rc = combination.replace( key: account_id, key: segment_73, key: 73, data:account_id, data:segment_73, data: 73 );
MPRINT(TEST_01):   if chunk = "..........01........."B then rc = combination.replace( key: account_id, key: segment_74, key: 74, data:account_id, data:segment_74, data: 74 );
MPRINT(TEST_01):   if chunk = "...........01........"B then rc = combination.replace( key: account_id, key: segment_75, key: 75, data:account_id, data:segment_75, data: 75 );
MPRINT(TEST_01):   if chunk = "............01......."B then rc = combination.replace( key: account_id, key: segment_76, key: 76, data:account_id, data:segment_76, data: 76 );
MPRINT(TEST_01):   if chunk = ".............01......"B then rc = combination.replace( key: account_id, key: segment_77, key: 77, data:account_id, data:segment_77, data: 77 );
MPRINT(TEST_01):   if chunk = "..............01....."B then rc = combination.replace( key: account_id, key: segment_78, key: 78, data:account_id, data:segment_78, data: 78 );
MPRINT(TEST_01):   if chunk = "...............01...."B then rc = combination.replace( key: account_id, key: segment_79, key: 79, data:account_id, data:segment_79, data: 79 );
MPRINT(TEST_01):   if chunk = "................01..."B then rc = combination.replace( key: account_id, key: segment_80, key: 80, data:account_id, data:segment_80, data: 80 );
MPRINT(TEST_01):   if chunk = ".................01.."B then rc = combination.replace( key: account_id, key: segment_81, key: 81, data:account_id, data:segment_81, data: 81 );
MPRINT(TEST_01):   if chunk = "..................01."B then rc = combination.replace( key: account_id, key: segment_82, key: 82, data:account_id, data:segment_82, data: 82 );
MPRINT(TEST_01):   if chunk = "...................01"B then rc = combination.replace( key: account_id, key: segment_83, key: 83, data:account_id, data:segment_83, data: 83 );

*/;

* Same post processing;

proc sql;
  create table summary_array_way as
  select segment, position, count(*) as count
  from 
    ( select * from combinations_array_way union all corresponding
      select * from each_position
    )
  group by segment, position;
quit;

proc transpose data=summary_array_way out=want_array_way(drop=_name_ where=(not missing(segment))) prefix=count_;
  by segment;
  id position;
  var count;
  format count 6.;
run;

Upvotes: 0

Tom
Tom

Reputation: 51566

It is still not at all clear what your algorithm is. So let's take a shot at it and see if this is what you mean. First let's convert your pasted listing into actual data.

data have ;
  input date $ actt $ seg1-seg5  abc1-abc5;
cards;
1/20 A X X X X Y 0 0 1 1 1
1/20 B X Y X X Y 1 1 0 1 1
1/20 C Y Y X X X 1 0 1 0 1
;

So there ate 5 months here. Since it looks like for month 3 you need to look at SEGMENT_3, ABC_4 and ABC_5 then you will get N-2 fewer months out than you have in the data. Let's convert this to a tall format. We can use a view so that we don't need to permanently store the vertical dataset.

data step1 / view=step1 ;
  set have ;
  array seg [5];
  array abc [5];
  do month=1 to dim(seg)-2;
   segment=seg[month];
   current=abc[month+1];
   next=abc[month+2];
   count_me=current=0 and next=1;
   output;
  end;
  keep date actt month segment current next count_me;
run;

Now we can add up how many COUNT_ME observations there are per SEGMENT*MONTH. For example by using PROC SQL.

proc sql ;
 create table step2 as
 select segment,month
     , sum(count_me) as method1
 from step1
 group by segment,month
;
quit;

Then to get a dataset in the format you show we just need to transpose that.

proc transpose data=step2 prefix=count_ out=want(drop=_name_);
  by segment ;
  id month;
  var method1 ;
run;

Results:

Obs    segment    count_1    count_2    count_3

 1        X          1          0          1
 2        Y          1          1          .

Notice how there is no value for COUNT_3 for SEGMENT=Y, since Y never appeared in SEG3 in the sample input.

Upvotes: 2

Related Questions