SlyGrogger
SlyGrogger

Reputation: 317

SAS: Creating dummy variables from categorical variable

I would like to turn the following long dataset:

data test;
input Id Injury $;
datalines;
1         Ankle
1         Shoulder 
2         Ankle
2         Head
3         Head
3         Shoulder
;
run;

Into a wide dataset that looks like this:

ID  Ankle Shoulder Head
1   1     1        0
2   1     0        1
3   0     1        1'

This answer seemed the most relevant but was falling over at the proc freq stage (my real dataset is around 1 million records, and has around 30 injury types): Creating dummy variables from multiple strings in the same row

Additional help: https://communities.sas.com/t5/SAS-Statistical-Procedures/Possible-to-create-dummy-variables-with-proc-transpose/td-p/235140

Thanks for the help!

Upvotes: 2

Views: 4779

Answers (2)

Dominic Comtois
Dominic Comtois

Reputation: 10431

Here's a solution involving only two steps... Just make sure your data is sorted by id first (the injury column doesn't need to be sorted).

First, create a macro variable containing the list of injuries

proc sql noprint;  
  select distinct injury  
    into :injuries separated by " "  
    from have  
    order by injury;  
quit;  

Then, let RETAIN do the magic -- no transposition needed!

data want(drop=i injury);
  set have;
  by id;

  format &injuries 1.;
  retain &injuries;
  array injuries(*) &injuries;

  if first.id then do i = 1 to dim(injuries);
    injuries(i) = 0;
  end;

  do i = 1 to dim(injuries); 
    if injury = scan("&injuries",i) then injuries(i) = 1;
  end;

  if last.id then output;
run;

EDIT

Following OP's question in the comments, here's how we could use codes and labels for injuries. It could be done directly in the last data step with a label statement, but to minimize hard-coding, I'll assume the labels are entered into a sas dataset.

1 - Define Labels:

data myLabels;
  infile datalines dlm="|" truncover;
  informat injury $12. labl $24.;
  input injury labl;
  datalines;
S460|Acute meniscal tear, medial
S520|Head trauma
;

2 - Add a new query to the existing proc sql step to prepare the label assignment.

proc sql noprint;  

  /* Existing query */
  select distinct injury  
    into :injuries separated by " "  
    from have  
    order by injury;

  /* New query */
  select catx("=",injury,quote(trim(labl)))
    into :labls separated by " "
    from myLabels;
quit;

3 - Then, at the end of the data want step, just add a label statement.

data want(drop=i injury);
  set have;
  by id;

  /* ...same as before... */

  * Add labels;
  label &labls;
run;

And that should do it!

Upvotes: 1

Reeza
Reeza

Reputation: 21294

Here's a basic method that should work easily, even with several million records. First you sort the data, then add in a count to create the 1 variable. Next you use PROC TRANSPOSE to flip the data from long to wide. Then fill in the missing values with a 0. This is a fully dynamic method, it doesn't matter how many different Injury types you have or how many records per person. There are other methods that are probably shorter code, but I think this is simple and easy to understand and modify if required.

data test;
input Id Injury $;
datalines;
1         Ankle
1         Shoulder 
2         Ankle
2         Head
3         Head
3         Shoulder
;
run;

proc sort data=test;
by id injury;
run;

data test2;
set test;
count=1;
run;

proc transpose data=test2 out=want prefix=Injury_;
by id;
var count;
id injury;
idlabel injury;
run;

data want;
set want;
array inj(*) injury_:;

do i=1 to dim(inj);
    if inj(i)=. then inj(i) = 0;
end;

drop _name_ i;
run;

Upvotes: 2

Related Questions