R - efficient comparison of subsets of rows between data frames

Question

thank you for any help.

I need to check the total number of matches from the elements of each row of a data frame (df1) on rows of another data frame (df2).

The data frames have different number of columns (5 in the first one versus 6 in the second one, for instance). And there is no exact formation rule for the rows (so I can not find a way of doing this through combinatory analysis)

This routine must check all the rows from the first data frame against all the rows of the second data frame, resulting a total number of occurences by the number of hits.

Not all the possible sums are of interest. Actually I am looking for a specific total (which I call "hits" in this text).

In other words: how many times a subset of each row of df2 of size "hits" can be found in rows of df1.

Here is an example:

> ### Example
> ### df1 and df2 here are regularly formed just for illustration purposes
>  
> require(combinat)
> 
> df1 <- as.data.frame(t(combn(6,5)))
> df2 <- as.data.frame(t(combn(7,6)))
> 
> df1
  V1 V2 V3 V4 V5
1  1  2  3  4  5
2  1  2  3  4  6
3  1  2  3  5  6
4  1  2  4  5  6
5  1  3  4  5  6
6  2  3  4  5  6
> 
> df2
  V1 V2 V3 V4 V5 V6
1  1  2  3  4  5  6
2  1  2  3  4  5  7
3  1  2  3  4  6  7
4  1  2  3  5  6  7
5  1  2  4  5  6  7
6  1  3  4  5  6  7
7  2  3  4  5  6  7
>

In this example, please note, for instance, that subsets of size 5, from row #1 of df2 can be found 6 times in the rows of df1. And so on.

I tried something like this:

> ### Check how many times subsets of size "hits" from rows from df2 are found in rows of df1
> 
> myfn <- function(dfa,dfb,hits) {
+       sapply(c(1:dim(dfb)[1]),function(y) { sum(c(apply(dfa,1,function(x,i) { sum(x %in% dfb[i,]) },i=y))==hits) })
+   }
>   
> r1 <- myfn(df1,df2,5)
> 
> cbind(df2,"hits.eq.5" = r1)
  V1 V2 V3 V4 V5 V6 hits.eq.5
1  1  2  3  4  5  6         6
2  1  2  3  4  5  7         1
3  1  2  3  4  6  7         1
4  1  2  3  5  6  7         1
5  1  2  4  5  6  7         1
6  1  3  4  5  6  7         1
7  2  3  4  5  6  7         1

This seems to do what I need, but it is too slow! I need using this routine on large data frames (about 200 K rows)

I am currently using R 3.1.2 GUI 1.65 Mavericks build (6833)

Can anyone provide a faster or more clever way of doing this? Than you again.

Best regards, Vaccaro

jlhoward · Accepted Answer

Using apply(...) on data frames is very inefficient. This is because apply(...) takes a matrix as argument, so if you pass a data frame it will coerce that to a matrix. In your example you convert df1 to a matrix every time you call apply(...), which is nrow(df2) times.

Also, by using sapply(1:nrow(df2),...) and dfb[i,] you are using data frame row indexing, which is also very inefficient. You are much better off converting everything to matrix class at the beginning and then using apply(...) twice.

Finally, there is no reason to use a call to c(...). apply(...) already returns a vector (in this case), so you are just incurring the overhead of another function call to no effect.

Doing these things alone speeds up your code by about a factor of 20.

set.seed(1)
nrows <- 100
df1 <- data.frame(matrix(sample(1:5,5*nrows,replace=TRUE),nc=5))
df2 <- data.frame(matrix(sample(1:6,6*nrows,replace=TRUE),nc=6))

myfn <- function(dfa,dfb,hits) {
  sapply(c(1:dim(dfb)[1]),function(y) { sum(c(apply(dfa,1,function(x,i) { sum(x %in% dfb[i,]) },i=y))==hits) })
}
myfn.2 <- function(dfa,dfb,hits) {
  ma <- as.matrix(dfa)
  mb <- as.matrix(dfb)
  apply(mb,1,function(y) { sum(apply(ma,1,function(x) { sum(x %in% y) })==hits) })
}

system.time(r1<-myfn(df1,df2,3))
#    user  system elapsed 
#    1.99    0.00    2.00 
system.time(r2<-myfn.2(df1,df2,3))
#    user  system elapsed 
#    0.09    0.00    0.10 
identical(r1,r2)
# [1] TRUE

There is another approach which takes advantage of the fact that R is extremely efficient at manipulating lists. Since a data frame is just a list of vectors, we can improve performance by putting your rows into data frame columns and then using sapply(..) on that. This is faster than myfn.2(...) above, but only by about 20%.

myfn.3 <-function(dfa,dfb,hits) {
  df1.t <- data.frame(t(dfa))   # rows into columns
  df2.t <- data.frame(t(dfb))
  sapply(df2.t,function(col2)sum(sapply(df1.t,function(col1)sum(col1 %in% col2)==hits)))
}
library(microbenchmark)
microbenchmark(myfn.2(df1,df2,5),myfn.3(df1,df2,5),times=10)
# Unit: milliseconds
#                 expr      min       lq   median       uq      max neval
#  myfn.2(df1, df2, 5) 92.84713 94.06418 96.41835 98.44738 99.88179    10
#  myfn.3(df1, df2, 5) 75.53468 77.44348 79.24123 82.28033 84.12457    10

If you really have a dataset with 55MM rows, then I think you need to rethink this problem. I have no idea what you are trying to accomplish, but this seems like a brute force approach.

R - efficient comparison of subsets of rows between data frames

Answers (1)

Related Questions