Which() for the whole dataset

Question

I want to write a function in R that does the following: I have a table of cases, and some data. I want to find the correct row matching to each observation from the data. Example:

crit1 <- c(1,1,2)
crit2 <- c("yes","no","no")
Cases <- matrix(c(crit1,crit2),ncol=2,byrow=FALSE)

data1 <- c(1,2,1)
data2 <- c("no","no","yes")
data <- matrix(c(data1,data2),ncol=2,byrow=FALSE)

Now I want a function that returns for each row of my data, the matching row from Cases, the result would be the vector

c(2,3,1)

MichaelChirico · Accepted Answer

Are you sure you want to be using matrices for this?

Note that the numeric data in crit1 and data1 has been converted to string (matrices can only store one data type):

typeof(data[ , 1L])
# [1] character

In R, a data.frame is a much more natural choice for what you're after. data.table is (among many other things) a toolset for working with "enhanced" data.frames; See the Introduction.

I would create your data as:

Cases = data.table(crit1, crit2)
data = data.table(data1, data2)

We can get the matching row indices as asked by doing a keyed join (See the vignette on keys):

setkey(Cases) # key by all columns
Cases
#    crit1 crit2
# 1:     1    no
# 2:     1   yes
# 3:     2    no
setkey(data)
data
#    data1 data2
# 1:     1    no
# 2:     1   yes
# 3:     2    no

Cases[data, which=TRUE]
# [1] 1 2 3

This differs from 2,3,1 because the order of your data has changed, but note that the answer is still correct.

If you don't want to change the order of your data, it's slightly more complicated (but more readable if you're not used to data.table syntax):

Cases = data.table(crit1, crit2)
data = data.table(data1, data2)

Cases[data, on = setNames(names(data), names(Cases)), which=TRUE]
# [1] 2 3 1

The on= part creates the mapping between the columns of data and those of Cases.

We could write this in a bit more SQL-like fashion as:

Cases[data, on = .(crit1 == data1, crit2 == data2), which=TRUE]
# [1] 2 3 1

This is shorter and more readable for your sample data, but not as extensible if your data has many columns or if you don't know the column names in advance.

Which() for the whole dataset

Answers (2)

Related Questions