Reputation: 1837

Is there an efficient way of comparing two data frames

I have two data frames with different number of rows, but the same number of columns. In the example below data frame 1 is 4 x 2, data frame 2 is 3 x 2. I need a 4 x 3 logical matrix where TRUE indicates that the all the rows in the data frames match. This example works, but takes a very long time to run with larger data frames (I'm trying two data frames with about 5,000 rows, but still just two columns). Is there a more efficient way of doing this?

> df1 <- data.frame(row.names=1:4, var1=c(TRUE, TRUE, FALSE, FALSE), var2=c(1,2,3,4))
> df2 <- data.frame(row.names=5:7, var1=c(FALSE, TRUE, FALSE), var2=c(5,2,3))
> 
> m1 <- t(as.matrix(df1))
> m2 <- as.matrix(df2)
> 
> apply(m2, 1, FUN=function(x) { apply(m1, 2, FUN=function(y) { all(x==y) } ) })
      5     6     7
1 FALSE FALSE FALSE
2 FALSE  TRUE FALSE
3 FALSE FALSE  TRUE
4 FALSE FALSE FALSE

Thanks in advance for any help.

Upvotes: 3

Answers (3)

Frank

Reputation: 66819

I suspect that the optimal solution depends on how many unique rows and how many total rows you have.

For the example on your blog, where there are 1000-1500 rows but only 20 unique values (for the seed you set there), I think it's faster to do this:

assign ids to each unique row and then
run outer on the vector of ids seen in each data.frame.

Here's the performance I got. @flodel's approach does about the same on my computer; it's the third one below. Disclaimer: I don't know much about running these kinds of tests.

> set.seed(2112)
> df1 <- data.frame(row.names=1:1000, 
+   var1=sample(c(TRUE,FALSE), 1000, replace=TRUE), 
+   var2=sample(1:10, 1000, replace=TRUE) )
> df2 <- data.frame(row.names=1001:2500, 
+   var1=sample(c(TRUE,FALSE), 1500, replace=TRUE),
+   var2=sample(1:10, 1500, replace=TRUE))
> 
> # candidate method on blog  
> system.time({
+  df1$var3 <- apply(df1, 1, paste, collapse='.')
+  df2$var3 <- apply(df2, 1, paste, collapse='.')
+  df6 <- sapply(df2$var3, FUN=function(x) { x == df1$var3 })
+  dimnames(df6) <- list(row.names(df1), row.names(df2))
+ })  
   user  system elapsed 
   1.13    0.00    1.14 
> 
> rownames(df1) <- NULL # in case something weird happens to rownames on merge
> rownames(df2) <- NULL
> # id method  
> system.time({  
+ df12 <- unique(rbind(df1,df2))
+ df12$id <- rownames(df12)
+ 
+ id1 <- merge(df12,df1)$id
+ id2 <- merge(df12,df2)$id
+ 
+ x <- outer(id1,id2,`==`)
+ })
   user  system elapsed 
   0.11    0.02    0.13 
> 
> library(fields)
> # rdlist from fields method
> system.time({  
+ mat1 <- as.matrix(sapply(df1, as.integer))
+ mat2 <- as.matrix(sapply(df2, as.integer))
+ rdist(mat1, mat2) < 1e-9
+ })
   user  system elapsed 
   0.15    0.00    0.16

I guess the rbind and the merges would make this solution relatively more costly with different data.

Upvotes: 0

flodel

Reputation: 89097

I was drawn here by your post on R-bloggers: http://jason.bryer.org/posts/2013-01-24/Comparing_Two_Data_Frames.html

If like you say, your data has no numeric vectors, then I think I can suggest a faster approach. It consists in:

turn your two data.frames into two matrices of integers
compute the Euclidean distance between rows of your two datas

Quick example using your data:

mat1 <- as.matrix(sapply(df1, as.integer))
mat2 <- as.matrix(sapply(df2, as.integer))
library(fields)
rdist(mat1, mat2) < 1e-9
#       [,1]  [,2]  [,3]
# [1,] FALSE FALSE FALSE
# [2,] FALSE  TRUE FALSE
# [3,] FALSE FALSE  TRUE
# [4,] FALSE FALSE FALSE

A few comments:

if your data contained vectors of characters, you would have to convert them into factors and make sure that they share the same factor levels.
I used the fields package to compute the Euclidean distance. It uses a Fortran implementation and is as far as I know the fastest R package around for the task (and I have tested many, trust me.)

Upvotes: 1

joran

Reputation: 173727

I'm honestly not sure if this will be faster, but you might try:

foo <- Vectorize(function(x,y) {all(df1[x,] == df2[y,])})
> outer(1:4,1:3,FUN = foo)
      [,1]  [,2]  [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE  TRUE FALSE
[3,] FALSE FALSE  TRUE
[4,] FALSE FALSE FALSE

I feel compelled to at least mention the danger in using == for comparisons as opposed to all.equal or identical. I'm presuming that you're comfortable enough with the data types involves that this won't be a problem.

Upvotes: 0

Is there an efficient way of comparing two data frames

Answers (3)

Related Questions