Reputation: 111
I have the following data set
PatientName BVAID Rank TreatmentCode TreatmentID DoseID
Tim Stuart BVA-027 3 OP_TBC 1 1
Tim Stuart BVA-041 4 OP_TBC 1 1
Tim Stuart BVA-021 7 OP_TBC 1 1
Tim Stuart BVA-048 10 OP_TBC 1 1
Tim Stuart BVA-020 14 OP_TBC 1 1
Tim Stuart BVA-024 15 OP_TBC 1 1
Tim Stuart BVA-001 16 OP_TBC 1 1
Tim Stuart BVA-013 27 OP_TBC 1 1
Tim Stuart BVA-018 28 OP_TBC 1 1
Tim Stuart BVA-051 29 OP_TBC 1 1
Tim Stuart BVA-027 3 OP_TC 2 1
Tim Stuart BVA-041 4 OP_TC 2 1
Tim Stuart BVA-048 10 OP_TC 2 1
Tim Stuart BVA-020 14 OP_TC 2 1
Tim Stuart BVA-001 16 OP_TC 2 1
Tim Stuart BVA-002 17 OP_TC 2 1
Tim Stuart BVA-019 18 OP_TC 2 1
Tim Stuart BVA-044 22 OP_TC 2 1
Tim Stuart BVA-025 23 OP_TC 2 1
Tim Stuart BVA-016 26 OP_TC 2 1
Tim Stuart BVA-013 27 OP_TC 2 1
Tim Stuart BVA-001 16 OP_SICO 3 1
Tim Stuart BVA-002 17 OP_SICO 3 1
Tim Stuart BVA-013 27 OP_SICO 3 1
I need to output the records with the smallest rank
in each TreatmentID
group however if the record was outputted in the previous TreatmentID
group I need to select the next smallest rank
and output the record for the TreamtmentID
group - I only need one record per TreatmentID
group.
This needs to be a scalable solution that I can automate.
My output file will only have tree unique records i.e. one per each group and each records is unique in BVAID
and will have the smallest rank in that group.
PatientName BVAID Rank TreatmentCode TreatmentID DoseID
Tim Stuart BVA-027 3 OP_TBC 1 1
Tim Stuart BVA-041 4 OP_TC 2 1
Tim Stuart BVA-001 16 OP_SICO 3 1
which program can handle this better SAS or R
Upvotes: 11
Views: 924
Reputation: 59612
Compact, scalable and readable R solution :
require(data.table)
DT = as.data.table(dat) # dat input from Brian's answer
r = 0
DT[,{r<<-min(Rank[Rank>r]); .SD[Rank==r]}, by=TreatmentID]
TreatmentID PatientName BVAID Rank TreatmentCode DoseID
[1,] 1 Tim Stuart BVA-027 3 OP_TBC 1
[2,] 2 Tim Stuart BVA-041 4 OP_TC 1
[3,] 3 Tim Stuart BVA-001 16 OP_SICO 1
Upvotes: 13
Reputation: 162401
Here's an R solution. I'd be really interested to know if there's a method that's much more compact than this.
library(plyr)
df <- df[order(df$PatientName, df$TreatmentID),]
ddply(df, .(PatientName), function(DF) {
# For each Treatment, find the value of Rank to be kept
splitRanks <- split(DF$Rank, DF$TreatmentID)
minRanks <- Reduce(f = function(X, Y) min(Y[Y>min(X)]),
x = splitRanks[-1],
init = min(splitRanks[[1]]), accumulate = TRUE)
# For each Treatment, extract row w/ Rank determined by the calculation above
splitDF <- split(DF, DF$TreatmentID)
rows <- mapply(FUN = function(X, Y) X[X$Rank==Y,], splitDF, minRanks,
SIMPLIFY = FALSE)
# Bind the extracted rows back together in a data frame
do.call("rbind", rows)
})
# PatientName BVAID Rank TreatmentCode TreatmentID DoseID
# 1 Tim Stuart BVA-027 3 OP_TBC 1 1
# 2 Tim Stuart BVA-041 4 OP_TC 2 1
# 3 Tim Stuart BVA-001 16 OP_SICO 3 1
Upvotes: 5
Reputation: 1511
my sas solution.
assume that you have data set (test) and sort it like what you have done here (by patientname, treatmentid then rank). This code fits in multiple patientnames situation and assume that these steps are performed for every one of patientname (delete all patientname related if you don't want this level )
%macro m1();
%begin: proc append base=new data=test(firstobs=1 obs=1);
data _null_;
set test(firstobs=1 obs=1);
call symput('r', rank);
call symput('id',Treatmentid);
call symput('name',patientname);
data test;
set test;
if (rank=&r or Treatmentid=&id) and patientname=symget('name') then delete;
%let dsid=%sysfunc(open(test));
%let nobs=%sysfunc(attrn(&dsid,nobs));
%let rc=%sysfunc(close(&dsid));
%if &nobs^=0 %then %goto begin;
%mend;
%m1(); run;
Upvotes: 2
Reputation: 8513
My SAS solution. All steps are scalable:
data test;
input PatientName $ 1-10
BVAID $
Rank
TreatmentCode $
TreatmentID
DoseID
;
datalines;
Tim Stuart BVA-027 3 OP_TBC 1 1
Tim Stuart BVA-041 4 OP_TBC 1 1
Tim Stuart BVA-021 7 OP_TBC 1 1
Tim Stuart BVA-048 10 OP_TBC 1 1
Tim Stuart BVA-020 14 OP_TBC 1 1
Tim Stuart BVA-024 15 OP_TBC 1 1
Tim Stuart BVA-001 16 OP_TBC 1 1
Tim Stuart BVA-013 27 OP_TBC 1 1
Tim Stuart BVA-018 28 OP_TBC 1 1
Tim Stuart BVA-051 29 OP_TBC 1 1
Tim Stuart BVA-027 3 OP_TC 2 1
Tim Stuart BVA-041 4 OP_TC 2 1
Tim Stuart BVA-048 10 OP_TC 2 1
Tim Stuart BVA-020 14 OP_TC 2 1
Tim Stuart BVA-001 16 OP_TC 2 1
Tim Stuart BVA-002 17 OP_TC 2 1
Tim Stuart BVA-019 18 OP_TC 2 1
Tim Stuart BVA-044 22 OP_TC 2 1
Tim Stuart BVA-025 23 OP_TC 2 1
Tim Stuart BVA-016 26 OP_TC 2 1
Tim Stuart BVA-013 27 OP_TC 2 1
Tim Stuart BVA-001 16 OP_SICO 3 1
Tim Stuart BVA-002 17 OP_SICO 3 1
Tim Stuart BVA-013 27 OP_SICO 3 1
;
run;
proc sort data=test;
by treatmentid;
run;
data test2;
set test;
by treatmentid;
retain smallest;
**
** CREATE AN EMPTY HASH TABLE THAT WE CAN STORE A LIST OF
** RANKS IN THAT HAVE ALREADY BEEN USED. DONE THIS WAY FOR
** SCALABILITY.
*;
if _n_ eq 1 then do;
declare hash ht();
ht.definekey ('rank');
ht.definedone();
end;
if first.treatmentid then do;
smallest = .;
end;
**
** IF THE CURRENT RANK HAS NOT ALREADY BEEN USED THEN
** EVALUATE IT TO SEE IF ITS THE SMALLEST VALUE.
*;
if ht.find() ne 0 then do;
smallest = min(smallest,rank);
end;
**
** SAVE THE SMALLEST UNUSED RANK BACK TO THE RANK VALUE.
** THEN ADD IT TO THE HASH TABLE AND FINALLY OUTPUT THE
** OBSERVATION.
*;
if last.treatmentid then do;
rank = smallest;
ht.add();
output;
end;
drop smallest;
run;
Does SAS win? jk! ;-)
Upvotes: 5
Reputation: 58845
Here is another R
solution. What makes this a harder problem than most is that it can not be treated as a split-apply-combine problem since the row to be selected depends not only on all rows with a given TreatmentID
, but also the result of what was determined by the previous (assuming that means the next smallest) TreatmentID
.
First, the data in a pasteable form (in case anyone else wants to take a crack at it):
dat <-
structure(list(PatientName = c("Tim Stuart", "Tim Stuart", "Tim Stuart",
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart",
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart",
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart",
"Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart", "Tim Stuart",
"Tim Stuart"), BVAID = c("BVA-027", "BVA-041", "BVA-021", "BVA-048",
"BVA-020", "BVA-024", "BVA-001", "BVA-013", "BVA-018", "BVA-051",
"BVA-027", "BVA-041", "BVA-048", "BVA-020", "BVA-001", "BVA-002",
"BVA-019", "BVA-044", "BVA-025", "BVA-016", "BVA-013", "BVA-001",
"BVA-002", "BVA-013"), Rank = c(3L, 4L, 7L, 10L, 14L, 15L, 16L,
27L, 28L, 29L, 3L, 4L, 10L, 14L, 16L, 17L, 18L, 22L, 23L, 26L,
27L, 16L, 17L, 27L), TreatmentCode = c("OP_TBC", "OP_TBC", "OP_TBC",
"OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC", "OP_TBC",
"OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_TC",
"OP_TC", "OP_TC", "OP_TC", "OP_TC", "OP_SICO", "OP_SICO", "OP_SICO"
), TreatmentID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), DoseID = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("PatientName", "BVAID",
"Rank", "TreatmentCode", "TreatmentID", "DoseID"), class = "data.frame",
row.names = c(NA, -24L))
Now my solution
matches <- dat[0,]
TreatmentIDs <- sort(unique(dat$TreatmentID))
for (TreatmentIDidx in seq_along(TreatmentIDs)) {
TreatmentID <- TreatmentIDs[TreatmentIDidx]
treat.flg <- dat$TreatmentID == TreatmentID
match <- dat[treat.flg &
dat$Rank == min(setdiff(dat$Rank[treat.flg],
matches$Rank[matches$TreatmentID==
TreatmentIDs[TreatmentIDidx-1]])),]
matches <- rbind(matches, match)
}
which gives the desired result:
> matches
PatientName BVAID Rank TreatmentCode TreatmentID DoseID
1 Tim Stuart BVA-027 3 OP_TBC 1 1
12 Tim Stuart BVA-041 4 OP_TC 2 1
22 Tim Stuart BVA-001 16 OP_SICO 3 1
My SAS
is rusty, and I don't have a copy to try things with right now, so I'll leave it to someone else to make a SAS
solution to compare with.
Upvotes: 3