Reputation: 69
I am trying to find efficient, transpose-like code that can convert the Test table below into the Final table. I have code below that works for this example, but for practical purposes, the split data step can produce a very deep table (137 million records in my case, vs 2k as aggregated).
I was hoping there was some Proc or Data step trick I was missing that could skip that intermediate step and be more efficient.
Data test;
Input f1 $ f2 $ f3 $ f4 $ f5 $ f6 $ f7 $ f8 $;
DataLines;
a c f h k l o q
a c f h k l o q
a c g h k m o q
b c g h k m o q
b d g i k m o r
b d g i k n o r
b e g j k n o s
b e g j k n p s
;
Run;
Data final;
Input field $ values $ records;
DataLines;
f1 a 3
f1 b 5
f2 c 4
f2 d 2
f2 e 2
f3 f 2
f3 g 6
f4 h 4
f4 i 2
f4 j 2
f5 k 8
f6 l 2
f6 m 3
f6 n 3
f7 o 7
f7 p 1
f8 q 4
f8 r 2
f8 s 2
;
Run;
/*Working solution - could it be done more efficiently?*/
Data split;
Set test;
Array f{8} f1-f8;
Do i=1 To 8;
field = 'f'||PUT(i,best2.);
values = f{i};
Output;
End;
Drop i f1-f8;
Run;
Proc SQL;
Create Table final As
Select
field
,values
,COUNT(*) As records Format=comma8.0
From split
Group By 1,2
Order By 1,2
;
Quit;
Upvotes: 1
Views: 705
Reputation: 27508
Massar:
Frequency counting can be done in a DATA Step hash object that uses the suminc:
and keysum:
features:
Data have;
Input f1 $ f2 $ f3 $ f4 $ f5 $ f6 $ f7 $ f8 $;
DataLines;
a c f h k l o q
a c f h k l o q
a c g h k m o q
b c g h k m o q
b d g i k m o r
b d g i k n o r
b e g j k n o s
b e g j k n p s
;
Run;
data want(keep=field value count);
length field $32 value $8 count 8;
one = 1;
call missing (field, value, count);
declare hash freq(suminc:'one', keysum:'count', ordered:'a', hashexp:20);
freq.defineKey('field', 'value');
freq.defineDone();
do while ( not end );
set have end=end;
array f f1-f8;
do over f; field = vname(f); value=f; freq.ref(); end;
end;
declare hiter iter("freq");
rc = iter.first();
do while(rc = 0);
rc = freq.sum(sum: count);
output;
rc = iter.next();
end;
stop;
run;
Upvotes: 1
Reputation: 51566
Easy enough to do with PROC SUMMARY
. Use your variable list as CLASS
variables. You can use the WAYS
statement to limit it to one-way classification groups. You can use the CHARTYPE
option to make it easy to transform the results. You probably will also want to add the MISSING
option to prevent proc summary eliminating input observations that have a missing value for any of the variables.
%let varlist=f1-f8 ;
proc summary data=test chartype missing ;
class &varlist ;
ways 1 ;
output out=result ;
run;
data want ;
set result ;
length field $32 value $8 count 8 ;
array fields &varlist ;
index=indexc(_type_,'1');
field=vname(fields(index));
value=fields(index);
count=_freq_;
keep field value count;
run;
You might want to reverse the order of the variables so that they come out in the original order. So use F8-F1
instead of F1-F8
.
Note that all of the variables need to be of the same type because of the ARRAY statement in the data step.
Upvotes: 1
Reputation: 63424
The solution that's likely most time-efficient is to use a summarization procedure. PROC TABULATE
seems best suited here.
proc tabulate data=test out=test_c;
class f1-f8;
tables (f1-f8),n;
run;
data want;
set test_c;
array f f1-f8;
col = cats(of f[*]);
which_f = whichc(col,of f[*]);
var = vname(f[which_f]);
keep col var n;
run;
That final datastep is very 'cheap', as it's only done on the summarized dataset so the 2k sized dataset. As long as you don't have a huge number of columns this should be pretty simple.
If you do have a huge number of columns as well, then the data step based solution that you present above is probably best. You may be able to speed it up significantly by making the intermediate (split) data step into a view.
Upvotes: 2