Anirban
Anirban

Reputation: 271

Medians by group in R

In the code below, I was wondering if anyone would have any insight on how to use plyr or data.table to elegantly compute med.val2.

library(plyr)

Setup example data

data <- data.frame(id1 = 1:20, id2 = rep(letters[1:4], 5), vals=rnorm(20))

First calculate group medians by loop

data$med.val <- rep(0, 20)
for (ind in 1:20) data$med.val[ind] <- median(data$vals[data$id2==data$id2[ind]])

Now using plyr

data <- ddply(data, .(id2), mutate, med.val.plyr=median(vals))

Should be equal

all.equal(data$med.val, data$med.val.plyr)

Median of rows with the same id2, excluding the focal row

# Median of values corresponding to 
# data$id1!=data$id1[ind] & data$id2==data$id2[ind]
data$med.val2 <- rep(0, 20)
for (ind in 1:20) data$med.val2[ind] <- median(data$vals[data$id1!=data$id1[ind] & data$id2==data$id2[ind]])

In R, I typically use plyr or data.table to efficiently calculate values by group. My actual data and functions are more complicated but identical in structure: I need to calculate a function using data from rows with a common identifier, excluding the focal row. And I cannot figure out a way to do that efficiently and elegantly.

Upvotes: 1

Views: 1489

Answers (2)

mrip
mrip

Reputation: 15163

One possible solution with data table:

dt = data.table(data)
dt[,med.val3 := sapply(.SD$id1, function(x) median(.SD[id1!=x,vals])), by=id2]

On edit: this solution, along with the ones by @shadow are relatively concise and elegant. They are also going to as efficient as you can probably hope for from this kind of solution. However, computing any leave-one-out statistic is going to be an O(n^2) operation (or worse) unless you can code up a more efficient implementation. For things like average and median, this is pretty straightforward, for example:

looMedian<-function(x){
  rng<-range(x)
  bigMedian<-median(c(x,rng[2]+1))
  smallMedian<-median(c(x,rng[1]-1))
  med<-median(x)
  ret<-ifelse(x<med,bigMedian,smallMedian)  
  wm<-which(x==med)
  if(length(wm)==0)
    return(ret)
  ret[wm]<-median(x[-wm[1]])
  ret
}

This is much more efficient than the naive solution:

looMedianSlow<-function(x){
  sapply(seq_along(x),function(z) median(x[-z]))
}


> xx<-rnorm(100)
> all.equal(looMedianSlow(xx),looMedian(xx))
[1] TRUE
> xx<-rnorm(101)
> all.equal(looMedianSlow(xx),looMedian(xx))
[1] TRUE
> microbenchmark(looMedianSlow(xx),looMedian(xx))
Unit: microseconds
              expr      min       lq    median        uq       max neval
 looMedianSlow(xx) 5174.193 5264.951 5308.5075 5398.6950 44771.062   100
     looMedian(xx)  241.462  248.513  260.0685  278.3615  3495.796   100

Whether something like this is possible in your case will depend on what statistic you are trying to compute.

Upvotes: 2

shadow
shadow

Reputation: 22343

I would go for an auxillary function and then use the same method as in median.

med2 <- function(x) sapply(seq_along(x), function(ind) median(x[-ind]))
data <- ddply(data, .(id2), mutate, med.val2.plyr=med2(vals))
all.equal(data$med.val2, data$med.val2.plyr)

Or the data.table way of doing the same:

dt <- data.table(data, key="id2")
med2 <- function(x) sapply(seq_along(x), function(ind) median(x[-ind]))
dt[, med.val2.dt:=med2(vals), by=id2]
all.equal(dt$med.val2, dt$med.val2.dt)

Upvotes: 1

Related Questions