Pam G
Pam G

Reputation: 17

PROC FORMAT does not work with BY statement in other procedures

I want to get distribution of a variable that is categorized using PROC FORMAT. However I do not get the frequency distribution based on the new groups using BY statement. I discovered this while using PHREG on a larger data. I have given a sample code below.

data p;
input v1 $ v2;
datalines;
A 1
A 2
A 1
A 2
B 3
B 2
C 1
D 1
;
RUN;

proc format;invalue $ v1f 'A','C'='Grp-1' 'B','D'='Grp-2'; run;

proc freq;tables v1; format v1 $v1f.;run;

proc sort;by v1; run;
proc freq;tables v2; by v1;format v1 $v1f.;run;

Not sure why the last PROC FREQ is not working as expected.

I need to keep changing these categories for iterative analysis and so I find PROC FORMAT easy to code but I am very confused as to why it is not working.

Any tips would be appreciated.

Upvotes: 0

Views: 181

Answers (2)

Richard
Richard

Reputation: 27498

Use the Proc FORMAT VALUE Statement to define a custom format.

Proc SQL and PUT() can be used to sort data in formatted order.

Proc FREQ BY processing will honor a formatted value when the contiguous underlying values in the data map to the same formatted value.

proc format; 
  value $v1f 
    'A','C'='Grp-1' 
    'B','D'='Grp-2';
run;
proc sql;
  create table two as
  select * 
  from have 
  order by put(v1,$v1f.), v1  /* ensure order is by formatted value, and then unerlying value within (for good measure in case data is viewed rawly) */
;
proc freq;
  tables v2; 
  by v1;
  format v1 $v1f.;
run;

enter image description here

Upvotes: 0

Tom
Tom

Reputation: 51566

To FORMAT a variable you need to use a FORMAT. The INVALUE statement is for defining an INFORMAT. To define a FORMAT you need to use the VALUE statement instead.

FORMATs are used to convert values to text. INFORMATs are used to convert text to values. You use a FORMAT with the FORMAT and PUT statements and the PUT() function. You use an INFORMAT with the INFORMAT and INPUT statements and the INPUT() function.

BY groups are done by the actual values, not the formatted values. If you want the frequencies of V1 crossed with V2 specify that in the TABLES statement.

proc freq;
  tables v1*v2;
  format v1 $v1f.;
run;

Results

The FREQ Procedure

Table of v1 by v2

v1        v2

Frequency|
Percent  |
Row Pct  |
Col Pct  |       1|       2|       3|  Total
---------+--------+--------+--------+
Grp-1    |      3 |      2 |      0 |      5
         |  37.50 |  25.00 |   0.00 |  62.50
         |  60.00 |  40.00 |   0.00 |
         |  75.00 |  66.67 |   0.00 |
---------+--------+--------+--------+
Grp-2    |      1 |      1 |      1 |      3
         |  12.50 |  12.50 |  12.50 |  37.50
         |  33.33 |  33.33 |  33.33 |
         |  25.00 |  33.33 | 100.00 |
---------+--------+--------+--------+
Total           4        3        1        8
            50.00    37.50    12.50   100.00

If you want to sort by the formatted value then use the PUT() function to make a new variable.

 data by_group;
    set p ;
    group = put(v1,$v1f.);
 run;
 proc sort data=by_group;
   by group;
 run;

Upvotes: 3

Related Questions