Reputation: 1837
I have two data frames with different number of rows, but the same number of columns. In the example below data frame 1 is 4 x 2, data frame 2 is 3 x 2. I need a 4 x 3 logical matrix where TRUE indicates that the all the rows in the data frames match. This example works, but takes a very long time to run with larger data frames (I'm trying two data frames with about 5,000 rows, but still just two columns). Is there a more efficient way of doing this?
> df1 <- data.frame(row.names=1:4, var1=c(TRUE, TRUE, FALSE, FALSE), var2=c(1,2,3,4))
> df2 <- data.frame(row.names=5:7, var1=c(FALSE, TRUE, FALSE), var2=c(5,2,3))
>
> m1 <- t(as.matrix(df1))
> m2 <- as.matrix(df2)
>
> apply(m2, 1, FUN=function(x) { apply(m1, 2, FUN=function(y) { all(x==y) } ) })
5 6 7
1 FALSE FALSE FALSE
2 FALSE TRUE FALSE
3 FALSE FALSE TRUE
4 FALSE FALSE FALSE
Thanks in advance for any help.
Upvotes: 3
Views: 2515
Reputation: 66819
I suspect that the optimal solution depends on how many unique rows and how many total rows you have.
For the example on your blog, where there are 1000-1500 rows but only 20 unique values (for the seed you set there), I think it's faster to do this:
Here's the performance I got. @flodel's approach does about the same on my computer; it's the third one below. Disclaimer: I don't know much about running these kinds of tests.
> set.seed(2112)
> df1 <- data.frame(row.names=1:1000,
+ var1=sample(c(TRUE,FALSE), 1000, replace=TRUE),
+ var2=sample(1:10, 1000, replace=TRUE) )
> df2 <- data.frame(row.names=1001:2500,
+ var1=sample(c(TRUE,FALSE), 1500, replace=TRUE),
+ var2=sample(1:10, 1500, replace=TRUE))
>
> # candidate method on blog
> system.time({
+ df1$var3 <- apply(df1, 1, paste, collapse='.')
+ df2$var3 <- apply(df2, 1, paste, collapse='.')
+ df6 <- sapply(df2$var3, FUN=function(x) { x == df1$var3 })
+ dimnames(df6) <- list(row.names(df1), row.names(df2))
+ })
user system elapsed
1.13 0.00 1.14
>
> rownames(df1) <- NULL # in case something weird happens to rownames on merge
> rownames(df2) <- NULL
> # id method
> system.time({
+ df12 <- unique(rbind(df1,df2))
+ df12$id <- rownames(df12)
+
+ id1 <- merge(df12,df1)$id
+ id2 <- merge(df12,df2)$id
+
+ x <- outer(id1,id2,`==`)
+ })
user system elapsed
0.11 0.02 0.13
>
> library(fields)
> # rdlist from fields method
> system.time({
+ mat1 <- as.matrix(sapply(df1, as.integer))
+ mat2 <- as.matrix(sapply(df2, as.integer))
+ rdist(mat1, mat2) < 1e-9
+ })
user system elapsed
0.15 0.00 0.16
I guess the rbind
and the merge
s would make this solution relatively more costly with different data.
Upvotes: 0
Reputation: 89057
I was drawn here by your post on R-bloggers: http://jason.bryer.org/posts/2013-01-24/Comparing_Two_Data_Frames.html
If like you say, your data has no numeric vectors, then I think I can suggest a faster approach. It consists in:
Quick example using your data:
mat1 <- as.matrix(sapply(df1, as.integer))
mat2 <- as.matrix(sapply(df2, as.integer))
library(fields)
rdist(mat1, mat2) < 1e-9
# [,1] [,2] [,3]
# [1,] FALSE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE TRUE
# [4,] FALSE FALSE FALSE
A few comments:
fields
package to compute the Euclidean distance. It uses a Fortran implementation and is as far as I know the fastest R package around for the task (and I have tested many, trust me.)Upvotes: 1
Reputation: 173577
I'm honestly not sure if this will be faster, but you might try:
foo <- Vectorize(function(x,y) {all(df1[x,] == df2[y,])})
> outer(1:4,1:3,FUN = foo)
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE FALSE
I feel compelled to at least mention the danger in using ==
for comparisons as opposed to all.equal
or identical
. I'm presuming that you're comfortable enough with the data types involves that this won't be a problem.
Upvotes: 0