R data.frame matching across columns and returning the column name of closest match

Question

I have a dataset with 100,000 rows (people) and 500 columns (probabilities) and would like to scan across the columns with a test probability to find the title of the column (a,b or c) that is greater than and closest to the test value and record the title in a new column.

So for instance with the data.table:

data <- data.table(   a = seq(0.2, 0.55, length.out = 9),
                      b = seq(0.35, 0.7, length.out = 9),
                      c = seq(0.5, 0.85, length.out = 9),
                   test = seq(0.1,  0.9, length.out = 9)
                  )

the new column would record "a" (as 0.1 <0.2) for the first row and then a, b, b, b, c, c, c, NA for the next eight rows. Th NAs would be recorded as the test probability is greater than the value in column c

       a    b   c   test
 0.20000    0.35000 0.50000 0.1
 0.24375    0.39375 0.54375 0.2
 0.28750    0.43750 0.58750 0.3
 0.33125    0.48125 0.63125 0.4
 0.37500    0.52500 0.67500 0.5
 0.41875    0.56875 0.71875 0.6
 0.46250    0.61250 0.76250 0.7
 0.50625    0.65625 0.80625 0.8
 0.55000    0.70000 0.85000 0.9

I initially did it as a matrix rather than data.table. The code below won't work but gives an idea of how it operated

Switch <- pmax(as.matrix(data[,a:c])-matrix(rep(test,3), ncol=3, byrow=F),0)  
# subtracts test from columns a,b,c and replaces negative values with 0

FirstSwitch <- Switch[,b:c]>0 & MemSwitch[,a:b]==0
#  finds the first non-zero occurance

MonthSwitchMem <-  apply(FirstSwitch, 1, which.max)
# calculates the column where the test probability first exceeds

How do I do this matching across columns in data.table. I think I need to use a query from .SDcols but am not sure how to do this?

Prashant Vaze · Accepted Answer

I adapted Karolis' answer so I transfer my columns from data.table to the snippet provided

data <- data.frame(   a = seq(0.2, 0.55, length.out = 9),
                  b = seq(0.35, 0.7, length.out = 9),
                  c = seq(0.5, 0.85, length.out = 9),
                  test = seq(0.1,  0.9, length.out = 9)
)
data2 <- data.table(data)
id <- c("a","b","c")
f <- function(x, t) {colnames(data2)[apply(sign(x-t), 1,function(vec){ match(1, vec) })]}
data2[, f(.SD, data2[,test]),.SDcols=id ]  #  this line takes the columns with the probabilities and the test probability and transfer to function f

thanks for your help (and also reformatting my question. This is my first post so apologies for the bad formatting)

Prashant

R data.frame matching across columns and returning the column name of closest match

Answers (2)

Related Questions