user3658367
user3658367

Reputation: 641

Percentage calculation based on multiple columns SAS

I have a data with patientID and the next column with illness, which has more than one category seperated by commas. I need to find the total number of patients per each illness category and the percentage of patients per category. I tried the normal way, it gives the frequency correct but not the percent.

The data looks like this.

ID    Type_of_illness  
4   lf13  
5   lf5,lf11    
63      
13  lf12    
85      
80      
15      
20      
131 lf6,lf7,lf12  
22      
24      
55  lf12  
150 lf12  
34  lf12  
49  lf12  
151 lf12  
60      
74      
88      
64      
82  lf13  
5   lf5,lf7  
112     
87  lf17  
78      
79  lf16  
83  lf11    

where the empty spaces represent no illness. I first separated the illnesses into separate columns, but then got stuck there not knowing how to process to find out the percent.

The code I wrote is as below:

    Data new;
    set old;
    array P(3) L1 L2 L3;
    do i to dim(p);
    p(i)=scan(type_of_illness,i,',');
    end;
    run;

Then I created a new column to copy all the illnesses to it so I thought it would give me the correct frequency, but it did not give me the correct percent.

data new;
set new;
L=L1;output;
L=L2;output;
L=L3;output;
run;
proc freq data=new;
tables L;run;

I have to create a table something like

*Total numer of patients    Percent*  
.......................................   
lf5         
lf7         
lf6         
lf11            
lf12            
lf13            

Please help.

Upvotes: 2

Views: 1653

Answers (3)

Joe
Joe

Reputation: 63424

The multilabel format solution is interesting, so I present it separately.

Using the same have, we create a format that takes every combination of illnesses and outputs a row for each illness in it, ie, if you have "1,2,3", it outputs rows

1,2,3 = 1
1,2,3 = 2
1,2,3 = 3

Enabling multilabel formats and using a class-enabled proc like proc tabulate, you can then use this to allow each respondent to count in each of the label values, but not be counted more than once against the total.

data for_procformat; 
set have;
start=type_of_illness;                     *start is the input to the format;
hlo=' m';                                  *m means multilabel, adding a space 
                                            here to leave room for the o later;
type='c';                                  *character format - n is numeric;
fmtname='$ILLF';                           *whatever name you like;
do _t = 1 to countw(type_of_illness,',');  *for each 'word' do this once;
  label=scan(type_of_illness,_t,',');      *label is the 'result' of the format;
  if not missing(label) then output;       
end;
if _n_=1 then do;                          *this block adds a row to deal with values;
  hlo='om';                                *not defined (in this case, just missings);
  label='No Illness';                      *the o means 'other';
  output;
end;
run;

proc sort data=for_procformat nodupkey;    *remove duplicates (which there will be many);
by start label;
run;

proc format cntlin=for_procformat;         *import the formats;
quit;

proc tabulate data=have;
class type_of_illness/mlf missing ;        *mlf means multilabel formats;
format type_of_illness $ILLF.;             *apply said format;
tables type_of_illness,n pctn;             *and your table;
run;

Upvotes: 0

rambles
rambles

Reputation: 706

You're trying to output percentages on non-mutually exclusive groups (each illness). It isn't obvious in SAS how to do this.

The following takes Joe's input code but takes an alternative route in determining percentages from event data (a 'long' dataset, if you will). I prefer this to creating a binary variable for each illness at the patient level (a 'wide' dataset) as, for me, this soon gets unwieldy. That said, if you then go on to do some modelling then a 'wide' dataset is usually more useful.

The following code produces output as follows:-


|                       |  Pats  |  Pats  |        |        |  Mean  |        |        |
|                       | with 0 |with 1+ | % with |  Num   | events |        |        |
|                       |records | record | record | Events |per pat |Std Dev | Median |
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf11                   |      24|       2|       8|       2|     1.0|    0.00|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf12                   |      19|       7|      27|       7|     1.0|    0.00|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf13                   |      24|       2|       8|       2|     1.0|    0.00|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf16                   |      25|       1|       4|       1|     1.0|       .|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf17                   |      25|       1|       4|       1|     1.0|       .|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf5                    |      25|       1|       4|       1|     1.0|       .|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf6                    |      25|       1|       4|       1|     1.0|       .|       1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf7                    |      24|       2|       8|       2|     1.0|    0.00|       1|
---------------------------------------------------------------------------------------|

Note that patient 5 is repeated in your data for illness lf5. My code only counts this record once. This is fine if a chronic illness but not if acute. Also, my code includes patients in the denominator who do not have an event.

Finally, you can see another example of this code using dates - with test data - here at the mycodestock.com code sharing site => https://mycodestock.com/public/snippet/11251

Here's the code for the table above:-

options nodate nonumber nocenter pageno=1 obs=max nofmterr ps=52 ls=100 formchar="|----||---|-/\<>*";

data have;
  format type_of_illness $30.;
  infile datalines truncover;
  input ID Type_of_illness $;
  datalines;
  4 lf13
  5 lf5,lf11
  63
  13 lf12
  85
  80
  15
  20
  131 lf6,lf7,lf12
  22
  24
  55 lf12
  150 lf12
  34 lf12
  49 lf12
  151 lf12
  60
  74
  88
  64
  82 lf13
  5 lf5,lf7
  112
  87 lf17
  78
  79 lf16
  83 lf11
  ;;;;
proc sort;
  by id;
run;

**  Create patient level data;
proc sort data = have(keep = id) out = pat_data nodupkey;
  by id;
run;

**  Create event table (1 row per patient*event);
**  NOTE: Patients without events are dropped (as is usual in events data);
data events(drop = i type_of_illness);
  set have;
  attrib grp length = $5 label = 'Illness';

  do i = 1 to countc(type_of_illness, ',') + 1;
    grp = scan(type_of_illness, i, ',');
    if grp ne '' then output;
  end;
run;

**  Count the number of events each patient had for each grp;
**  NOTE: The NODUPKEY in the PROC SORT remove duplicate records (within PAT & GRP);
**  NOTE: The use of CLASSDATA and COMPLETETYPES ensures zero counts for all patients and grps;
proc sort in = events out = perc2_summ_grp_pat nodupkey;
  by grp id;
proc summary data = perc2_summ_grp_pat nway missing classdata = pat_data completetypes;
  by grp;
  class id;
  output out = perc2_summ_grp_pat(rename=(_freq_ = num_events) drop=_type_);
run;

**  Add a denominator variable - value '1' for each row.;
**  Ensure when num_events = 0 the value is set to missing;  
**  Create a flag variable - set to 1 - if a patient has a record (no matter how many);  
data perc2_summ_grp_pat;
  set perc2_summ_grp_pat;
  denom = 1;
  if num_events = 0 then num_events = .;
  flg_scripts = ifn(num_events, 1, .);
run;

proc tabulate data = perc2_summ_grp_pat format=comma8.;
  title1 bold "Table 1: N, % and basic statistics of events within non-mutually exclusive groups";
  title2 "Units: Patients - within each group level";
  title3 "The statistics summarises the number of events (not whether a patient had at least 1 event)";
  title4 "This means, for the statistics, only patients with 1+ record are included in the denominator";

  class grp;
  var denom flg_scripts num_events;
  table grp='', flg_scripts=''*(nmiss='Pats with 0 records' n='Pats with 1+ record' pctsum<denom>='% with record') 
                num_events=''*(sum='Num Events' mean='Mean events per pat'*f=8.1 stddev='Std Dev'*f=8.2 p50='Median');
run; title; footnote;

Upvotes: 1

Joe
Joe

Reputation: 63424

You're going about this right, but you need to pick percent differently. Normally percent is 'percent of whole dataset', which means that it is going to triplicate your base. You want the percent based to the illness. This means you need a 1/0 for each illness.

The one downside is you have the 0's in your automatic tables; you would have to output the table to a dataset and remove them, then proc print/report the resulting dataset to get the 1's only - or use PROC SQL to generate the table.

data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
run;

data want;
set have;
array L[8] lf5-lf7 lf11-lf13 lf16 lf17;
do _t = 1 to dim(L);
  if find(type_of_illness,trim(vname(L[_t]))) then L[_t]=1;
  else L[_t]=0;
end;
run;

proc tabulate data=want;
class lf:;
tables lf:,n pctn;
run;

Upvotes: 0

Related Questions