user6528780
user6528780

Reputation:

R - Compare 2 matrices to find rows which rows aren't in both

I have two large matrices in R of differing sizes, 371 x 1502 (A) and 371 x 1207 (B).

All of matrix B is included in A. A also contains many other rows mixed in. I am looking for a way to create a new matrix, C, which contains all the rows in A not found in B.

I am sure there is a way to do this using data.tables and keys but I can't for the life of me figure it out.

example data:

a = t(matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3))
b = t(matrix(c(1,2,3,7,8,9), nrow = 3))

Any help is appreciated,

Thanks.

Upvotes: 1

Views: 228

Answers (2)

Jealie
Jealie

Reputation: 6277

I would do it in base R:

a[!duplicated(rbind(b,a))[(nrow(b)+1):(nrow(a)+nrow(b))], ]

... But a data.table solution might be more elegant and/or quicker.


Thanks to @thelatemail, here is the data.table version:

a[!b, on=names(a)]

And here is a benchmark of all solutions proposed so far, here and by Maurits Evers:

require('data.table')
require('plyr')
require('microbenchmark')    

n = 371
n1 = 1502
n2 = 1207

b = matrix(0 + sample.int(n * n2), ncol = n)
a = rbind(matrix(n*n2+1 + sample.int(n * (n1 - n2)), ncol = n), b)
b = b[sample.int(nrow(b)), ]

# preparing the data.table and data.frame versions of the data:
a.dt = data.table(a)
b.dt = data.table(b)
a.df = as.data.frame(a)
b.df = as.data.frame(b)

microbenchmark(
  BASE_R = a[!duplicated(rbind(b,a))[(nrow(b)+1):(nrow(a)+nrow(b))], ],
  DATA.TABLE = a.dt[!b.dt, on=names(a.dt)],
  DATA.TABLE2 = fsetdiff(a.dt, b.dt),
  PLYR = anti_join(a.df, b.df),
  times = 100
)

For this problem, the plyr solution proposed in the other answer is the fastest. The two data.table solutions are trailing it closely, and the base R version is much slower.

    Unit: milliseconds
        expr        min         lq       mean     median        uq       max neval cld
      BASE_R 1125.05968 1412.13170 1555.82674 1577.81665 1703.3674 1927.1632   100   c
  DATA.TABLE   54.68581   83.99182  117.90571   91.86808  123.8300  318.3788   100  b 
 DATA.TABLE2   58.44053   86.90981  127.11152   97.39086  138.8306  328.1396   100  b 
        PLYR   30.87235   49.32260   61.02968   53.66639   59.6925  278.6965   100 a  

Upvotes: 2

Maurits Evers
Maurits Evers

Reputation: 50718

Is this what you want?

Using dplyr::anti_join:

require(dplyr);
anti_join(as.data.frame(a), as.data.frame(b));
#  V1 V2 V3
#1  4  5  6

Using data.table::fsetdiff:

require(data.table);
fsetdiff(as.data.table(a), as.data.table(b));
#   V1 V2 V3
#1:  4  5  6

Sample data

# Your sample data
a = t(matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3))
b = t(matrix(c(1,2,3,7,8,9), nrow = 3));

Upvotes: 1

Related Questions