Justin
Justin

Reputation: 55

In SAS, how do you collapse multiple rows into one row based on some ID variable?

The data I am working with is currently in the form of:

 ID     Sex      Race         Drug         Dose          FillDate  
 1      M        White        ziprosidone  100mg         10/01/98     
 1      M        White        ziprosidone  100mg         10/15/98
 1      M        White        ziprosidone  100mg         10/29/98
 1      M        White        ambien       20mg          01/07/99
 1      M        White        ambien       20mg          01/14/99
 2      F        Asian        telaprevir   500mg         03/08/92
 2      F        Asian        telaprevir   500mg         03/20/92
 2      F        Asian        telaprevir   500mg         04/01/92

And I would like to write SQL code to get the data in the form of:

 ID     Sex    Race      Drug1        DrugDose1     FillDate1_1     FillDate1_2     FillDate1_3    Drug2     DrugDose2   FillDate2_1     FillDate2_2     FillDate2_3     
 1      M      White     ziprosidone  100mg         10/01/98        10/15/98        10/29/98       ambien    20mg        01/07/99        01/14/99        null
 2      F      Asian     telaprevir   500mg         03/08/92        03/20/92        04/01/92       null      null        null            null            null

I need just one row for each unique ID with all of the unique drug/dose/fill info in columns, not rows. I suppose it can be done using PROC TRANSPOSE, but I am not sure of the most efficient way of doing the multiple transposes. I should note that I have over 50,000 unique IDs, each with varying amounts of drugs, doses, and corresponding fill dates. I would like to return null/empty values for those columns that do not have data to fill in. Thanks in advance.

Upvotes: 0

Views: 9023

Answers (3)

user667489
user667489

Reputation: 9569

Here's my attempt at an array-based solution:

/*  Import data */
 data have; 
 input @2 ID  @9 Sex $1. @18 Race $5. @31 Drug $11. @44 Dose $5. @58 FillDate mmddyy8.;
 format filldate yymmdd10.;
 cards;
 1      M        White        ziprosidone  100mg         10/01/98     
 1      M        White        ziprosidone  100mg         10/15/98
 1      M        White        ziprosidone  100mg         10/29/98
 1      M        White        ambien       20mg          01/07/99
 1      M        White        ambien       20mg          01/14/99
 2      F        Asian        telaprevir   500mg         03/08/92
 2      F        Asian        telaprevir   500mg         03/20/92
 2      F        Asian        telaprevir   500mg         04/01/92
 ;
 run;


/* Calculate array bounds - SQL version  */
proc sql _method noprint;
    select DATES into :MAX_DATES_PER_DRUG trimmed from 
        (select count(ID) as DATES from have group by ID, drug, dose)
        having DATES = max(DATES);
    select max(DRUGS) into :MAX_DRUGS_PER_ID trimmed from 
        (select count(DRUG) as DRUGS from 
            (select distinct DRUG, ID from have)
            group by ID
        )
    ;       
quit;

/* Calculate array bounds - data step version */
data _null_;
    set have(keep = id drug) end = eof;
    by notsorted id drug;
    retain max_drugs_per_id max_dates_per_drug;
    if first.id   then drug_count = 0;
    if first.drug then do;
        drug_count + 1;
        date_count = 0;
    end;
    date_count + 1;
    if last.id      then max_drugs_per_id   = max(max_drugs_per_id,     drug_count);
    if last.drug    then max_dates_per_drug = max(max_dates_per_drug,   date_count);
    if eof then do;
        call symput("max_drugs_per_id"  ,cats(max_drugs_per_id));
        call symput("max_dates_per_drug",cats(max_dates_per_drug));     
    end;
run;


/* Check macro vars */
%put MAX_DATES_PER_DRUG = "&MAX_DATES_PER_DRUG";
%put MAX_DRUGS_PER_ID   = "&MAX_DRUGS_PER_ID";

/* Transpose */
data want;
    if 0 then set have;
    array filldates[&MAX_DRUGS_PER_ID,&MAX_DATES_PER_DRUG] 
    %macro arraydef;
        %local i;
        %do i = 1 %to &MAX_DRUGS_PER_ID;
            filldates&i._1-filldates&i._&MAX_DATES_PER_DRUG
        %end;
    %mend arraydef;
    %arraydef;
    array drugs[&MAX_DRUGS_PER_ID] $11;
    array doses[&MAX_DRUGS_PER_ID] $5;
    drug_count = 0;
    do until(last.id);
        set have;
        by ID drug dose notsorted;
        if first.drug then do;
            date_count = 0;
            drug_count + 1;
            drugs[drug_count] = drug;
            doses[drug_count] = dose;
        end;
        date_count + 1;
        filldates[drug_count,date_count] = filldate;
    end;
    drop drug dose filldate drug_count date_count;
    format filldates: yymmdd10.;
run;

The data step code for calculating the array bounds is probably more efficient than the SQL version, but it's also bit more verbose. On the other hand, with the SQL version you also have to trim whitespace from the macro vars. Fixed - thanks Tom!

The transposing data step is probably also at the more efficient end of the scale compared to the proc transpose / proc sql options in the other answers, as it makes only 1 further pass through the dataset, but again it's also fairly complex.

Upvotes: 0

Parfait
Parfait

Reputation: 107707

Consider the following query using two derived tables (inner and outer) that establishes an ordinal row count by the FillDate order. Then, using the row count, if/then or case/when logic is used for iterated columns. Outer query takes the max values grouped by id, sex, race.

The only caveat is knowing ahead how many expected or max number of rows per ID (i.e., another query our table browse). Hence, fill in ellipsis (...) as needed. Do note, missings will generate for columns that do not apply to a particular ID. And of course please adjust to actual dataset name.

proc sql;
CREATE TABLE DrugTableFlat AS ( 
SELECT id, sex, race,
       Max(Drug_1) As Drug1, Max(Drug_2) As Drug2, Max(Drug_3) As Drug3, ...
       Max(Dose_1) As Dose1, Max(Dose_2) As Dose2, Max(Dose_3) As Dose3, ...
       Max(FillDate_1) As FillDate1, Max(FillDate_2) As FillDate2, 
       Max(FillDate_3) As FillDate3 ...
FROM 
   (SELECT id, sex, race,
       CASE WHEN RowCount=1 THEN Drug END AS Drug_1,
       CASE WHEN RowCount=2 THEN Drug END AS Drug_2,
       CASE WHEN RowCount=3 THEN Drug END AS Drug_3,
       ...
       CASE WHEN RowCount=1 THEN Dose END AS Dose_1,
       CASE WHEN RowCount=2 THEN Dose END AS Dose_2,
       CASE WHEN RowCount=3 THEN Dose END AS Dose_3,
       ...
       CASE WHEN RowCount=1 THEN FillDate END AS FillDate_1,
       CASE WHEN RowCount=2 THEN FillDate END AS FillDate_2,
       CASE WHEN RowCount=3 THEN FillDate END AS FillDate_3,
       ...
    FROM
       (SELECT t1.id, t1.sex, t1.race, t1.drug, t1.dose, t1.filldate,
          (SELECT Count(*) FROM DrugTable t2 
           WHERE t1.filldate >= t2.filldate AND t1.id = t2.id) As RowCount
        FROM DrugTable t1) AS dT1
    ) As dT2
GROUP BY id, sex, race);

Upvotes: 0

Joe
Joe

Reputation: 63434

To some extent, the desired efficiency of this determines the best solution.

For example, assuming you know the maximum reasonable number of fill dates, you could use the following to very quickly get a transposed table - likely the fastest way to do that - but at the cost of needing a large amount of post-processing, as it will output a lot of data you don't really want.

proc summary data=have nway;
class id sex race;
output out=want (drop=_:) 
        idgroup(out[5] (drug dose filldate)=) / autoname;
run;

On the other side of things, the vertical-and-transpose is the "best" solution in terms of not requiring additional steps; though it might be slow.

data have_t;
  set have;
  by id sex race drug dose notsorted;
  length varname value $64; *some reasonable maximum, particularly for the drug name;
  if first.ID then do;
    drugcounter=0;
  end;     
  if first.dose then do;
    drugcounter+1; 
    fillcounter=0;
    varname = cats('Drug',drugcounter);
    value   = drug;
    output;
    varname = cats('DrugDose',drugcounter);
    value = dose;
    output;
  end;
  call missing(value);
  fillcounter+1;
  varname=cats('Filldate',drugcounter,'_',fillcounter);
  value_n = filldate;
  output;
run;
proc transpose data=have_t(where=(not missing(value))) out=want_c;
  by id sex race ;
  id varname;
  var value;
run;
proc transpose data=have_t(where=(not missing(value_n))) out=want_n;
  by id sex race ;
  id varname;
  var value_n;
run;

data want;
  merge want_c want_n;
  by id sex race;
run;

It's not crazy slow, really, and odds are it's fine for your 50k IDs (though you don't say how many drugs). 1 or 2 GB of data will work fine here, especially if you don't need to sort them.

Finally, there are some other solutions that are in between. You could do the transpose entirely using arrays in the data step, for one, which might be the best compromise; you have to determine in advance the maximum bounds for the arrays, but that's not the end of the world.

It all depends on your data, though, which is really the best. I would probably try the data step/transpose first: that's the most straightforward, and the one most other programmers will have seen before, so it's most likely the best solution unless it's prohibitively slow.

Upvotes: 2

Related Questions