sswwqqaa
sswwqqaa

Reputation: 1616

Divide a dataset into subsets based on a column and perform a repeated operation for subsets

I need to perform the same operation on many different periods. In my sample data for two periods: 402 and 403.

I cannot understand the concept of how I can make a loop that will do it for me.

At the end, I'd like to have final1 for period 402, final2 for period 403 etc.

Sample data that I use for testing:

data one;
  input period $ a $ b $ c $ d e;
cards;
402 a . a 1 3
402 . b . 2 4
402 a a a . 5
402 . . b 3 5
403 a a a . 6
403 a a a . 7
403 a a a 2 8
;
run;

This is how I manually choose one period of one data:

data new;
 set one;
 where period='402';
run;

This is how I calculate different things for the given period e.g. number of missing data, non-missing, total:

1 - For numeric variables:

proc iml;
use new;
read all var _NUM_ into x[colname=nNames]; 
n = countn(x,"col");
nmiss = countmiss(x,"col");
ntotal = n + nmiss;

2 - and similarly for char variables:

read all var _CHAR_ into x[colname=cNames]; 
close nww;
c = countn(x,"col");
cmiss = countmiss(x,"col");
ctotal = c + cmiss;

Save numeric and char results:

create cnt1Data var {nNames n nmiss ntotal}; 
append;
close cnt1Data; 

create cnt2Data var {cNames c cmiss ctotal}; 
append;
close cnt2Data; 

Rename columns to be the same:

data cnt1Datatemp;
set cnt1Data;
rename nNames = Name n = nonMissing nmiss = missing ntotal = total;
run;
data cnt2Datatemp;
set cnt2Data;
rename cNames = Name c = nonMissing cmiss = missing ctotal = total;
run;

and merge data into the final set:

data final;
set cnt1Datatemp cnt2Datatemp;
run;

Final data for period 402 should look like:

a b c d e
2 2 1 1 0   -  missing
2 2 3 3 4   -  non-missing
4 4 4 4 4   -  total

and respectively for period 403:

a b c d e
0 0 0 2 0   -  missing
3 3 3 1 3   -  non-missing
3 3 3 3 3   -  total

Upvotes: 0

Views: 269

Answers (3)

Tom
Tom

Reputation: 51566

You can make something similar with simple SQL query.

create table miss_count as select period
           , sum(missing(A)) as A
           , sum(missing(B)) as B
           ...
from have
group by period
;

Results:

period   a    b    c    d    e

 402     2    2    1    1    0
 403     0    0    0    2    0

It you add in

, count(*) as nobs

then you have all the information you need to calculate all of the counts you wanted.

If the number of variables is short enough you can even generate the code into a macro variable (limit of 64K bytes in a macro variable)

proc sql noprint;
select catx(' ','sum(missing(',nliteral(name),')) as',nliteral(name))
  into :varlist separated by ','
  from dictionary.columns 
  where libname='WORK' and memname='ONE' and lowcase(name) ne 'period'
;
create table miss_count as select period,count(*) as nobs,&varlist
  from one
  group by period
;
quit;

Results:

period    nobs    a    b    c    d    e

 402        4     2    2    1    1    0
 403        3     0    0    0    2    0

Upvotes: 1

Dirk Horsten
Dirk Horsten

Reputation: 3845

You don't need iml to summarize data over observations. You can do that with a retain statement too. Moreover, using by processing with first and last, you can process all periods in one go.

data final;
    set one;
    by period;
    if first.period then do;
        mis_a = 0;
        total = 0;
    end;
    retain mis_a;
    if missing(a) then mis_a +=1; else fil_a += 1;
    total += 1;
    if last.period;
    fil_a = total - mis_a;
end;

This is by far the fastest way to handle a big dataset if the data is sorted by period. To make it work for a set of variables not known upfront, you can apply the same techniques as in my other solution.

Upvotes: 1

Dirk Horsten
Dirk Horsten

Reputation: 3845

It is much easier to find this information in sql;

proc sql;
    select  sum(a is not missing) as fil_a
    ,   sum(a is missing) as mis_a
    ,   count(*) as tot_a
    from one
    where period eq 402;
quit;

You can even 0handle all periods at once using group by.

There are a few ways to make this work for all variables in a dataset (except for some group by variables). For instance:

%macro count_missing();
proc sql;
    select  count(*), name
    into    :no_var, :var_list separated by ' '
    from sasHelp.vcolumn 
    where libName eq 'WORK' and memName eq 'ONE' and upcase(name) ne 'PERIOD';

    create view count_missing as 
    select  count(*) as total
    %do var_nr = 1 %to &no_var;
        %let var = %scan(&var_list, &var_nr);
        , sum(&var is missing) as mis_&var
    %end;
    from work.one
    group by period;
quit;
data report_missing;
    set count_missing;
    format count_of $32.;
    
    count_of = 'missing';
    %do var_nr = 1 %to &no_var;
        %let var = %scan(&var_list, &var_nr);
        &var = mis_&var;
    %end;
    output;

    count_of = 'non missing';
    %do var_nr = 1 %to &no_var;
        %let var = %scan(&var_list, &var_nr);
        &var = total - mis_&var;
    %end;
    output;

    count_of = 'total';
    %do var_nr = 1 %to &no_var;
        %let var = %scan(&var_list, &var_nr);
        &var = total;
    %end;
    output;
end;
%mend;
%count_missing();

Upvotes: 1

Related Questions