Reputation:
I have two large matrices in R of differing sizes, 371 x 1502 (A) and 371 x 1207 (B).
All of matrix B is included in A. A also contains many other rows mixed in. I am looking for a way to create a new matrix, C, which contains all the rows in A not found in B.
I am sure there is a way to do this using data.tables and keys but I can't for the life of me figure it out.
example data:
a = t(matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3))
b = t(matrix(c(1,2,3,7,8,9), nrow = 3))
Any help is appreciated,
Thanks.
Upvotes: 1
Views: 228
Reputation: 6277
I would do it in base R:
a[!duplicated(rbind(b,a))[(nrow(b)+1):(nrow(a)+nrow(b))], ]
... But a data.table solution might be more elegant and/or quicker.
Thanks to @thelatemail, here is the data.table
version:
a[!b, on=names(a)]
And here is a benchmark of all solutions proposed so far, here and by Maurits Evers:
require('data.table')
require('plyr')
require('microbenchmark')
n = 371
n1 = 1502
n2 = 1207
b = matrix(0 + sample.int(n * n2), ncol = n)
a = rbind(matrix(n*n2+1 + sample.int(n * (n1 - n2)), ncol = n), b)
b = b[sample.int(nrow(b)), ]
# preparing the data.table and data.frame versions of the data:
a.dt = data.table(a)
b.dt = data.table(b)
a.df = as.data.frame(a)
b.df = as.data.frame(b)
microbenchmark(
BASE_R = a[!duplicated(rbind(b,a))[(nrow(b)+1):(nrow(a)+nrow(b))], ],
DATA.TABLE = a.dt[!b.dt, on=names(a.dt)],
DATA.TABLE2 = fsetdiff(a.dt, b.dt),
PLYR = anti_join(a.df, b.df),
times = 100
)
For this problem, the plyr
solution proposed in the other answer is the fastest. The two data.table
solutions are trailing it closely, and the base R
version is much slower.
Unit: milliseconds
expr min lq mean median uq max neval cld
BASE_R 1125.05968 1412.13170 1555.82674 1577.81665 1703.3674 1927.1632 100 c
DATA.TABLE 54.68581 83.99182 117.90571 91.86808 123.8300 318.3788 100 b
DATA.TABLE2 58.44053 86.90981 127.11152 97.39086 138.8306 328.1396 100 b
PLYR 30.87235 49.32260 61.02968 53.66639 59.6925 278.6965 100 a
Upvotes: 2
Reputation: 50718
Is this what you want?
Using dplyr::anti_join
:
require(dplyr);
anti_join(as.data.frame(a), as.data.frame(b));
# V1 V2 V3
#1 4 5 6
Using data.table::fsetdiff
:
require(data.table);
fsetdiff(as.data.table(a), as.data.table(b));
# V1 V2 V3
#1: 4 5 6
# Your sample data
a = t(matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3))
b = t(matrix(c(1,2,3,7,8,9), nrow = 3));
Upvotes: 1