Makoto
Makoto

Reputation: 111

sort and output records with SAS and R

I have the following data set

PatientName BVAID   Rank    TreatmentCode   TreatmentID DoseID  
Tim Stuart  BVA-027 3   OP_TBC            1             1  
Tim Stuart  BVA-041 4   OP_TBC            1             1  
Tim Stuart  BVA-021 7   OP_TBC            1             1  
Tim Stuart  BVA-048 10  OP_TBC            1             1  
Tim Stuart  BVA-020 14  OP_TBC            1             1  
Tim Stuart  BVA-024 15  OP_TBC            1             1  
Tim Stuart  BVA-001 16  OP_TBC            1             1  
Tim Stuart  BVA-013 27  OP_TBC            1             1  
Tim Stuart  BVA-018 28  OP_TBC            1             1  
Tim Stuart  BVA-051 29  OP_TBC            1             1  
Tim Stuart  BVA-027 3   OP_TC             2             1  
Tim Stuart  BVA-041 4   OP_TC             2             1  
Tim Stuart  BVA-048 10  OP_TC             2             1  
Tim Stuart  BVA-020 14  OP_TC             2             1    
Tim Stuart  BVA-001 16  OP_TC             2             1  
Tim Stuart  BVA-002 17  OP_TC             2             1    
Tim Stuart  BVA-019 18  OP_TC             2             1  
Tim Stuart  BVA-044 22  OP_TC             2             1  
Tim Stuart  BVA-025 23  OP_TC             2             1  
Tim Stuart  BVA-016 26  OP_TC             2             1  
Tim Stuart  BVA-013 27  OP_TC             2             1  
Tim Stuart  BVA-001 16  OP_SICO           3             1  
Tim Stuart  BVA-002 17  OP_SICO           3             1  
Tim Stuart  BVA-013 27  OP_SICO           3             1  

I need to output the records with the smallest rank in each TreatmentID group however if the record was outputted in the previous TreatmentID group I need to select the next smallest rank and output the record for the TreamtmentID group - I only need one record per TreatmentID group. This needs to be a scalable solution that I can automate. My output file will only have tree unique records i.e. one per each group and each records is unique in BVAID and will have the smallest rank in that group.

PatientName BVAID   Rank    TreatmentCode   TreatmentID DoseID  
Tim Stuart  BVA-027 3   OP_TBC            1             1  
Tim Stuart  BVA-041 4   OP_TC             2             1  
Tim Stuart  BVA-001 16  OP_SICO           3             1  

which program can handle this better SAS or R

Upvotes: 11

Views: 924

Answers (5)

Matt Dowle
Matt Dowle

Reputation: 59612

Compact, scalable and readable R solution :

require(data.table)
DT = as.data.table(dat)   # dat input from Brian's answer
r = 0
DT[,{r<<-min(Rank[Rank>r]); .SD[Rank==r]}, by=TreatmentID]

     TreatmentID PatientName   BVAID Rank TreatmentCode DoseID
[1,]           1  Tim Stuart BVA-027    3        OP_TBC      1
[2,]           2  Tim Stuart BVA-041    4         OP_TC      1
[3,]           3  Tim Stuart BVA-001   16       OP_SICO      1

Upvotes: 13

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162401

Here's an R solution. I'd be really interested to know if there's a method that's much more compact than this.

library(plyr)

df <- df[order(df$PatientName, df$TreatmentID),]

ddply(df, .(PatientName), function(DF) {
    # For each Treatment, find the value of Rank to be kept    
    splitRanks <- split(DF$Rank, DF$TreatmentID)
    minRanks <- Reduce(f = function(X, Y) min(Y[Y>min(X)]), 
                       x = splitRanks[-1], 
                       init = min(splitRanks[[1]]), accumulate = TRUE)
    # For each Treatment, extract row w/ Rank determined by the calculation above
    splitDF <- split(DF, DF$TreatmentID)
    rows <- mapply(FUN = function(X, Y) X[X$Rank==Y,], splitDF, minRanks, 
                   SIMPLIFY = FALSE)
    # Bind the extracted rows back together in a data frame
    do.call("rbind", rows)
})

#   PatientName   BVAID Rank TreatmentCode TreatmentID DoseID
# 1  Tim Stuart BVA-027    3        OP_TBC           1      1
# 2  Tim Stuart BVA-041    4         OP_TC           2      1
# 3  Tim Stuart BVA-001   16       OP_SICO           3      1

Upvotes: 5

Robbie Liu
Robbie Liu

Reputation: 1511

my sas solution.

assume that you have data set (test) and sort it like what you have done here (by patientname, treatmentid then rank). This code fits in multiple patientnames situation and assume that these steps are performed for every one of patientname (delete all patientname related if you don't want this level )

%macro m1();
%begin: proc append base=new data=test(firstobs=1 obs=1);

data _null_;
set test(firstobs=1 obs=1);
call symput('r', rank);
call symput('id',Treatmentid);
call symput('name',patientname);

data test;
set test;
if (rank=&r or Treatmentid=&id) and patientname=symget('name') then delete;

%let dsid=%sysfunc(open(test));
%let nobs=%sysfunc(attrn(&dsid,nobs)); 
%let rc=%sysfunc(close(&dsid));

%if &nobs^=0 %then %goto begin;
%mend;

%m1(); run;

Upvotes: 2

Robert Penridge
Robert Penridge

Reputation: 8513

My SAS solution. All steps are scalable:

data test;
  input  PatientName $ 1-10
         BVAID $ 
         Rank    
         TreatmentCode $  
         TreatmentID 
         DoseID 
        ;
datalines;
Tim Stuart  BVA-027 3   OP_TBC            1             1  
Tim Stuart  BVA-041 4   OP_TBC            1             1  
Tim Stuart  BVA-021 7   OP_TBC            1             1  
Tim Stuart  BVA-048 10  OP_TBC            1             1  
Tim Stuart  BVA-020 14  OP_TBC            1             1  
Tim Stuart  BVA-024 15  OP_TBC            1             1  
Tim Stuart  BVA-001 16  OP_TBC            1             1  
Tim Stuart  BVA-013 27  OP_TBC            1             1  
Tim Stuart  BVA-018 28  OP_TBC            1             1  
Tim Stuart  BVA-051 29  OP_TBC            1             1  
Tim Stuart  BVA-027 3   OP_TC             2             1  
Tim Stuart  BVA-041 4   OP_TC             2             1  
Tim Stuart  BVA-048 10  OP_TC             2             1  
Tim Stuart  BVA-020 14  OP_TC             2             1    
Tim Stuart  BVA-001 16  OP_TC             2             1  
Tim Stuart  BVA-002 17  OP_TC             2             1    
Tim Stuart  BVA-019 18  OP_TC             2             1  
Tim Stuart  BVA-044 22  OP_TC             2             1  
Tim Stuart  BVA-025 23  OP_TC             2             1  
Tim Stuart  BVA-016 26  OP_TC             2             1  
Tim Stuart  BVA-013 27  OP_TC             2             1  
Tim Stuart  BVA-001 16  OP_SICO           3             1  
Tim Stuart  BVA-002 17  OP_SICO           3             1  
Tim Stuart  BVA-013 27  OP_SICO           3             1  
;
run;

proc sort data=test;
  by treatmentid;
run;

data test2;
  set test;
  by treatmentid;
  retain smallest;

  **
  ** CREATE AN EMPTY HASH TABLE THAT WE CAN STORE A LIST OF 
  ** RANKS IN THAT HAVE ALREADY BEEN USED. DONE THIS WAY FOR
  ** SCALABILITY.
  *;
  if _n_ eq 1 then do;
    declare hash ht();
    ht.definekey ('rank');
    ht.definedone();
  end;

  if first.treatmentid then do;
    smallest = .;
  end;

  **
  ** IF THE CURRENT RANK HAS NOT ALREADY BEEN USED THEN 
  ** EVALUATE IT TO SEE IF ITS THE SMALLEST VALUE.
  *;
  if ht.find() ne 0 then do;
    smallest = min(smallest,rank);
  end;

  **
  ** SAVE THE SMALLEST UNUSED RANK BACK TO THE RANK VALUE.
  ** THEN ADD IT TO THE HASH TABLE AND FINALLY OUTPUT THE
  ** OBSERVATION.
  *;
  if last.treatmentid then do;
    rank = smallest;
    ht.add();    
    output;
  end;

  drop smallest;
run;

Does SAS win? jk! ;-)

Upvotes: 5

Brian Diggs
Brian Diggs

Reputation: 58845

Here is another R solution. What makes this a harder problem than most is that it can not be treated as a split-apply-combine problem since the row to be selected depends not only on all rows with a given TreatmentID, but also the result of what was determined by the previous (assuming that means the next smallest) TreatmentID.

First, the data in a pasteable form (in case anyone else wants to take a crack at it):

dat <-
structure(list(PatientName = c("Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", 
"Tim Stuart"), BVAID = c("BVA-027", "BVA-041", "BVA-021", "BVA-048", 
"BVA-020", "BVA-024", "BVA-001", "BVA-013", "BVA-018", "BVA-051", 
"BVA-027", "BVA-041", "BVA-048", "BVA-020", "BVA-001", "BVA-002", 
"BVA-019", "BVA-044", "BVA-025", "BVA-016", "BVA-013", "BVA-001", 
"BVA-002", "BVA-013"), Rank = c(3L, 4L, 7L, 10L, 14L, 15L, 16L, 
27L, 28L, 29L, 3L, 4L, 10L, 14L, 16L, 17L, 18L, 22L, 23L, 26L, 
27L, 16L, 17L, 27L), TreatmentCode = c("OP_TBC", "OP_TBC", "OP_TBC", 
"OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", 
"OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", 
"OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_SICO", "OP_SICO", "OP_SICO"
), TreatmentID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), DoseID = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("PatientName", "BVAID", 
"Rank", "TreatmentCode", "TreatmentID", "DoseID"), class = "data.frame", 
row.names = c(NA, -24L))

Now my solution

matches <- dat[0,]
TreatmentIDs <- sort(unique(dat$TreatmentID))
for (TreatmentIDidx in seq_along(TreatmentIDs)) {
    TreatmentID <- TreatmentIDs[TreatmentIDidx]
    treat.flg <- dat$TreatmentID == TreatmentID
    match <- dat[treat.flg &
                 dat$Rank == min(setdiff(dat$Rank[treat.flg],
                                         matches$Rank[matches$TreatmentID==
                                                      TreatmentIDs[TreatmentIDidx-1]])),]
    matches <- rbind(matches, match)
}

which gives the desired result:

> matches
   PatientName   BVAID Rank TreatmentCode TreatmentID DoseID
1   Tim Stuart BVA-027    3        OP_TBC           1      1
12  Tim Stuart BVA-041    4         OP_TC           2      1
22  Tim Stuart BVA-001   16       OP_SICO           3      1

My SAS is rusty, and I don't have a copy to try things with right now, so I'll leave it to someone else to make a SAS solution to compare with.

Upvotes: 3

Related Questions