Ernest A
Ernest A

Reputation: 7839

a faster implementation of merge.data.frame() in R

Let's say a and b are two data frames. The goal is to write a function f(a,b) that produces a merged data frame, in the same way as merge merge(a,b,all=TRUE) would do, that is filling missing variables in a or b with NAs. (The problem is merge() appears to be very slow.)

This can be done as follows (pseudo-code):

for each variable `var` found in either `a` or `b`, do:
    unlist(list(a.srcvar, b.srcvar), recursive=FALSE, use.names=FALSE)

where:
x.srcvar is x$var if x$var exists, or else
            rep(NA, nrow(x)) if y$var !is.factor, or else
            as.factor(rep(NA, nrow(x)))

and then wrap everything in a data frame.

Here's a "naive" implementation:

merge.datasets1 <- function(a, b) {
  a.fill <- rep(NA, nrow(a))
  b.fill <- rep(NA, nrow(b))
  a.fill.factor <- as.factor(a.fill)
  b.fill.factor <- as.factor(b.fill)
  out <- list()
  for (v in union(names(a), names(b))) {
    if (!v %in% names(a)) {
      b.srcvar <- b[[v]]
      if (is.factor(b.srcvar))
        a.srcvar <- a.fill.factor
      else
        a.srcvar <- a.fill
    } else {
      a.srcvar <- a[[v]]
      if (v %in% names(b))
        b.srcvar <- b[[v]]
      else if (is.factor(a.srcvar))
        b.srcvar <- b.fill.factor
      else
        b.srcvar <- b.fill
    }
    out[[v]] <- unlist(list(a.srcvar, b.srcvar),
                       recursive=FALSE, use.names=FALSE)
  }
  data.frame(out)
}

Here's a different implementation that uses "vectorized" functions:

merge.datasets2 <- function(a, b) {
  srcvar <- within(list(var=union(names(a), names(b))), {
    a.exists <- var %in% names(a)
    b.exists <- var %in% names(b)
    a.isfactor <- unlist(lapply(var, function(v) is.factor(a[[v]])))
    b.isfactor <- unlist(lapply(var, function(v) is.factor(b[[v]])))
    a <- ifelse(a.exists, var, ifelse(b.isfactor, 'fill.factor', 'fill'))
    b <- ifelse(b.exists, var, ifelse(a.isfactor, 'fill.factor', 'fill'))
  })
  a <- within(a, {
    fill <- NA
    fill.factor <- factor(fill)
  })
  b <- within(b, {
    fill <- NA
    fill.factor <- factor(fill)
  })
  out <- mapply(function(x,y) unlist(list(a[[x]], b[[y]]),
                                     recursive=FALSE, use.names=FALSE),
                srcvar$a, srcvar$b, SIMPLIFY=FALSE, USE.NAMES=FALSE)
  out <- data.frame(out)
  names(out) <- srcvar$var
  out
}

Now we can test:

sample.datasets <- lapply(1:50, function(i) iris[,sample(names(iris), 4)])

system.time(invisible(Reduce(merge.datasets1, sample.datasets)))
>>   user  system elapsed 
>>  0.192   0.000   0.190 
system.time(invisible(Reduce(merge.datasets2, sample.datasets)))
>>   user  system elapsed 
>>  2.292   0.000   2.293 

So, the naive version is orders of magnitude faster than the other. How can this be? I always thought that for loops are slow, and that one should rather use lapply and friends and steer clear of loops in R. I would welcome any idea on how to improve my function in terms of speed.

Upvotes: 0

Views: 1938

Answers (1)

mnel
mnel

Reputation: 115435

In fact, you are not doing trying to replicate merge(a,b, all = TRUE) at all, as you are not trying to merge on any of the columns. Instead you are simply stacking the data, filling with NA where a column does not exist.

 # note  that this is not what you want/
dim(merge(sample.datasets[[1]], sample.datasets[[2]], all = T))
 [1] 314   5

The reason merge(a,b, all = TRUE) will be slow is that it defaults to merging by the intersection of the names. If you convert to data.tables then the merge.data.table method is lightning fast, but with your test data, it would be creating an exponentially increasing dataset on each sucessive merge (not 7500 by 5 as you want your results to be)

An easy solution is to use rbind.fill from the plyr package.

library(plyr)
system.time({.x <- Reduce(rbind.fill, sample.datasets)})
## user  system elapsed 
## 0.16    0.00    0.15 
# which is almost identical to
system.time(.old <- Reduce(merge.datasets1, sample.datasets))
##   user  system elapsed 
##   0.14    0.00    0.14 

EDIT 2-11-2012

On further consideration it is really useful to note that you can pass a list of data.frames to rbind.fill so

 system.time(super_fast <- rbind.fill(sample.datasets))
 ##  user  system elapsed 
 ##  0.02    0.00    0.02 

identical(super_fast, .old)
[1] TRUE

The majority of time spent in the overheads for Reduce, which rbind.fill does not require.

Upvotes: 3

Related Questions