Reputation: 7839
Let's say a
and b
are two data frames. The goal is to write a function
f(a,b)
that produces a merged data frame, in the same way as merge
merge(a,b,all=TRUE)
would do, that is filling missing variables in a
or b
with NAs. (The problem is merge()
appears to be very slow.)
This can be done as follows (pseudo-code):
for each variable `var` found in either `a` or `b`, do:
unlist(list(a.srcvar, b.srcvar), recursive=FALSE, use.names=FALSE)
where:
x.srcvar is x$var if x$var exists, or else
rep(NA, nrow(x)) if y$var !is.factor, or else
as.factor(rep(NA, nrow(x)))
and then wrap everything in a data frame.
Here's a "naive" implementation:
merge.datasets1 <- function(a, b) {
a.fill <- rep(NA, nrow(a))
b.fill <- rep(NA, nrow(b))
a.fill.factor <- as.factor(a.fill)
b.fill.factor <- as.factor(b.fill)
out <- list()
for (v in union(names(a), names(b))) {
if (!v %in% names(a)) {
b.srcvar <- b[[v]]
if (is.factor(b.srcvar))
a.srcvar <- a.fill.factor
else
a.srcvar <- a.fill
} else {
a.srcvar <- a[[v]]
if (v %in% names(b))
b.srcvar <- b[[v]]
else if (is.factor(a.srcvar))
b.srcvar <- b.fill.factor
else
b.srcvar <- b.fill
}
out[[v]] <- unlist(list(a.srcvar, b.srcvar),
recursive=FALSE, use.names=FALSE)
}
data.frame(out)
}
Here's a different implementation that uses "vectorized" functions:
merge.datasets2 <- function(a, b) {
srcvar <- within(list(var=union(names(a), names(b))), {
a.exists <- var %in% names(a)
b.exists <- var %in% names(b)
a.isfactor <- unlist(lapply(var, function(v) is.factor(a[[v]])))
b.isfactor <- unlist(lapply(var, function(v) is.factor(b[[v]])))
a <- ifelse(a.exists, var, ifelse(b.isfactor, 'fill.factor', 'fill'))
b <- ifelse(b.exists, var, ifelse(a.isfactor, 'fill.factor', 'fill'))
})
a <- within(a, {
fill <- NA
fill.factor <- factor(fill)
})
b <- within(b, {
fill <- NA
fill.factor <- factor(fill)
})
out <- mapply(function(x,y) unlist(list(a[[x]], b[[y]]),
recursive=FALSE, use.names=FALSE),
srcvar$a, srcvar$b, SIMPLIFY=FALSE, USE.NAMES=FALSE)
out <- data.frame(out)
names(out) <- srcvar$var
out
}
Now we can test:
sample.datasets <- lapply(1:50, function(i) iris[,sample(names(iris), 4)])
system.time(invisible(Reduce(merge.datasets1, sample.datasets)))
>> user system elapsed
>> 0.192 0.000 0.190
system.time(invisible(Reduce(merge.datasets2, sample.datasets)))
>> user system elapsed
>> 2.292 0.000 2.293
So, the naive version is orders of magnitude faster than the other. How can
this be? I always thought that for
loops are slow, and that one should
rather use lapply
and friends and steer clear of loops in R. I would welcome any idea on how to improve my function in terms of speed.
Upvotes: 0
Views: 1938
Reputation: 115435
In fact, you are not doing trying to replicate merge(a,b, all = TRUE)
at all, as you are not trying to merge on any of the columns. Instead you are simply stacking the data, filling with NA
where a column does not exist.
# note that this is not what you want/
dim(merge(sample.datasets[[1]], sample.datasets[[2]], all = T))
[1] 314 5
The reason merge(a,b, all = TRUE)
will be slow is that it defaults to merging by the intersection of the names. If you convert to data.tables
then the merge.data.table
method is lightning fast, but with your test data, it would be creating an exponentially increasing dataset on each sucessive merge (not 7500 by 5 as you want your results to be)
An easy solution is to use rbind.fill
from the plyr
package.
library(plyr)
system.time({.x <- Reduce(rbind.fill, sample.datasets)})
## user system elapsed
## 0.16 0.00 0.15
# which is almost identical to
system.time(.old <- Reduce(merge.datasets1, sample.datasets))
## user system elapsed
## 0.14 0.00 0.14
On further consideration it is really useful to note that you can pass a list of data.frames
to rbind.fill
so
system.time(super_fast <- rbind.fill(sample.datasets))
## user system elapsed
## 0.02 0.00 0.02
identical(super_fast, .old)
[1] TRUE
The majority of time spent in the overheads for Reduce
, which rbind.fill
does not require.
Upvotes: 3