Reputation: 31
I have a dataset with 100,000 rows (people) and 500 columns (probabilities) and would like to scan across the columns with a test probability to find the title of the column (a,b or c) that is greater than and closest to the test value and record the title in a new column.
So for instance with the data.table:
data <- data.table( a = seq(0.2, 0.55, length.out = 9),
b = seq(0.35, 0.7, length.out = 9),
c = seq(0.5, 0.85, length.out = 9),
test = seq(0.1, 0.9, length.out = 9)
)
the new column would record "a" (as 0.1 <0.2) for the first row and then a, b, b, b, c, c, c, NA for the next eight rows. Th NAs would be recorded as the test probability is greater than the value in column c
a b c test
0.20000 0.35000 0.50000 0.1
0.24375 0.39375 0.54375 0.2
0.28750 0.43750 0.58750 0.3
0.33125 0.48125 0.63125 0.4
0.37500 0.52500 0.67500 0.5
0.41875 0.56875 0.71875 0.6
0.46250 0.61250 0.76250 0.7
0.50625 0.65625 0.80625 0.8
0.55000 0.70000 0.85000 0.9
I initially did it as a matrix rather than data.table. The code below won't work but gives an idea of how it operated
Switch <- pmax(as.matrix(data[,a:c])-matrix(rep(test,3), ncol=3, byrow=F),0)
# subtracts test from columns a,b,c and replaces negative values with 0
FirstSwitch <- Switch[,b:c]>0 & MemSwitch[,a:b]==0
# finds the first non-zero occurance
MonthSwitchMem <- apply(FirstSwitch, 1, which.max)
# calculates the column where the test probability first exceeds
How do I do this matching across columns in data.table. I think I need to use a query from .SDcols but am not sure how to do this?
Upvotes: 1
Views: 559
Reputation: 31
I adapted Karolis' answer so I transfer my columns from data.table to the snippet provided
data <- data.frame( a = seq(0.2, 0.55, length.out = 9),
b = seq(0.35, 0.7, length.out = 9),
c = seq(0.5, 0.85, length.out = 9),
test = seq(0.1, 0.9, length.out = 9)
)
data2 <- data.table(data)
id <- c("a","b","c")
f <- function(x, t) {colnames(data2)[apply(sign(x-t), 1,function(vec){ match(1, vec) })]}
data2[, f(.SD, data2[,test]),.SDcols=id ] # this line takes the columns with the probabilities and the test probability and transfer to function f
thanks for your help (and also reformatting my question. This is my first post so apologies for the bad formatting)
Prashant
Upvotes: 1
Reputation: 9666
This works with the data being matrix (not data.table).
colnames(data)[apply(sign(data[,1:3] - data[,4]), 1, function(vec){ match(1, vec) })]
Upvotes: 0