Reputation: 271
In the code below, I was wondering if anyone would have any insight on how to use plyr or data.table to elegantly compute med.val2.
library(plyr)
Setup example data
data <- data.frame(id1 = 1:20, id2 = rep(letters[1:4], 5), vals=rnorm(20))
First calculate group medians by loop
data$med.val <- rep(0, 20)
for (ind in 1:20) data$med.val[ind] <- median(data$vals[data$id2==data$id2[ind]])
Now using plyr
data <- ddply(data, .(id2), mutate, med.val.plyr=median(vals))
Should be equal
all.equal(data$med.val, data$med.val.plyr)
Median of rows with the same id2, excluding the focal row
# Median of values corresponding to
# data$id1!=data$id1[ind] & data$id2==data$id2[ind]
data$med.val2 <- rep(0, 20)
for (ind in 1:20) data$med.val2[ind] <- median(data$vals[data$id1!=data$id1[ind] & data$id2==data$id2[ind]])
In R, I typically use plyr or data.table to efficiently calculate values by group. My actual data and functions are more complicated but identical in structure: I need to calculate a function using data from rows with a common identifier, excluding the focal row. And I cannot figure out a way to do that efficiently and elegantly.
Upvotes: 1
Views: 1489
Reputation: 15163
One possible solution with data table:
dt = data.table(data)
dt[,med.val3 := sapply(.SD$id1, function(x) median(.SD[id1!=x,vals])), by=id2]
On edit: this solution, along with the ones by @shadow are relatively concise and elegant. They are also going to as efficient as you can probably hope for from this kind of solution. However, computing any leave-one-out statistic is going to be an O(n^2) operation (or worse) unless you can code up a more efficient implementation. For things like average and median, this is pretty straightforward, for example:
looMedian<-function(x){
rng<-range(x)
bigMedian<-median(c(x,rng[2]+1))
smallMedian<-median(c(x,rng[1]-1))
med<-median(x)
ret<-ifelse(x<med,bigMedian,smallMedian)
wm<-which(x==med)
if(length(wm)==0)
return(ret)
ret[wm]<-median(x[-wm[1]])
ret
}
This is much more efficient than the naive solution:
looMedianSlow<-function(x){
sapply(seq_along(x),function(z) median(x[-z]))
}
> xx<-rnorm(100)
> all.equal(looMedianSlow(xx),looMedian(xx))
[1] TRUE
> xx<-rnorm(101)
> all.equal(looMedianSlow(xx),looMedian(xx))
[1] TRUE
> microbenchmark(looMedianSlow(xx),looMedian(xx))
Unit: microseconds
expr min lq median uq max neval
looMedianSlow(xx) 5174.193 5264.951 5308.5075 5398.6950 44771.062 100
looMedian(xx) 241.462 248.513 260.0685 278.3615 3495.796 100
Whether something like this is possible in your case will depend on what statistic you are trying to compute.
Upvotes: 2
Reputation: 22343
I would go for an auxillary function and then use the same method as in median
.
med2 <- function(x) sapply(seq_along(x), function(ind) median(x[-ind]))
data <- ddply(data, .(id2), mutate, med.val2.plyr=med2(vals))
all.equal(data$med.val2, data$med.val2.plyr)
Or the data.table
way of doing the same:
dt <- data.table(data, key="id2")
med2 <- function(x) sapply(seq_along(x), function(ind) median(x[-ind]))
dt[, med.val2.dt:=med2(vals), by=id2]
all.equal(dt$med.val2, dt$med.val2.dt)
Upvotes: 1