values exist in data.table in multiple columns R

Question

I have a data.table

dt <- data.table(c(1,2,3,4),c("chr1","chr1","chr2","chr3"),c(12,12,13,14))

This is just an example data frame, but in real I have million of rows with three columns. I need to find if a set of value exists in it or not. Example, I want to check if "chr1" in column 2 and "2" in column exist together?

I was using "chr1" %in% dt$V2, but dont know how to combine two value criteria and make it fast for millions of row.

Arun · Accepted Answer

In data.table, we can use both a) traditional vector-scan approach (what base-R and dplyr's filter does), and b) use binary search which is incredibly faster than (a).

require(data.table)
setkey(dt, V1, V2)
dt[J(2, "chr1"), nomatch=0L] # by default nomatch=NA

nomatch=NA returns NA when no matches are there, and nomatch=0L returns only those rows that match.

Setting keys on a data.table sorts by those columns (and sets the attribute sorted). This helps to use binary search on the data, that searches your query in O(log n) time complexity (on average), as opposed to a vector scan approach that has O(n) complexity (since it has to scan through all the rows). It also needs a logical vector the size of the number of rows in your data.

What @akrun and @rrs have shown are examples of vector scan approaches using data.table and dplyr, respectively. And @eddi has shown you (repeated here in this answer) the binary search approach using data.table.

This is particularly useful if you're wanting to subset multiple times.

In the next versions of data.table, we're trying to get this speed-up happen internally with the normal vector scan syntax. Matt's implemented it, and testing it. If and when that comes out, you can use the usual vector scan syntax, and internally, data.table will use the binary search approach. That'd a great feature, I think you'd agree ;).

values exist in data.table in multiple columns R

Answers (1)

Related Questions