Reputation: 393
I need to subset a data.table based on a condition across multiple columns and then perform an operation on the result.
A minimal example with a simple data.table:
x <- data.table(id=c(1, 2, 3, 4), colour1 = c('red', 'green', 'green', 'blue'),
colour2 = c('yellow', 'red', 'blue', 'black'),
colour3 = c('blue', 'black', 'red', 'yellow'),
score = c(0.7, 0.9, 0.2, 0.35))
I then want to find the maximum score for any row that contains the colour 'yellow':
max_score <- max(x[colour1 == 'yellow' | colour2 == 'yellow' | colour3 == 'yellow']$score)
In this example, the above code works fine but is there a better way of checking this condition across multiple columns? In practice, the table will be much larger and the number of columns can change so I'd like to do this dynamically.
Upvotes: 4
Views: 308
Reputation: 70246
It may feel difficult because your data is in a wide (i.e. "untidy") format. Just convert to long format using melt
and it's really simple.
melt(x, measure.vars = grep("^colo", names(x)))[,max(score[value == "yellow"])]
# [1] 0.7
or
melt(x, measure.vars = grep("^colo", names(x)))[value == "yellow", max(score)]
To see what's going on, run this first
melt(x, measure.vars = grep("^colo", names(x)))
and then the whole command.
Of course you can do it in two steps too, in case you want to keep the long format for further calculations:
x_tidy <- melt(x, measure.vars = grep("^colo", names(x)))
x_tidy[value == "yellow", max(score)]
#[1] 0.7
Upvotes: 7
Reputation: 16697
Using R computing on the language feature.
i.cols.equal <- function(cols, value) {
l <- lapply(cols, function(col) call("==", as.name(col), value))
Reduce(function(a, b) bquote(.(a) | .(b)), l)
}
ii <- i.cols.equal(cols=c("colour1", "colour2", "colour3"), value="yellow")
print(ii)
#colour1 == "yellow" | colour2 == "yellow" | colour3 == "yellow"
x[eval(ii), max(score)]
#[1] 0.7
Upvotes: 1
Reputation: 13122
Another idea, similar to jangorecki's (probably more straightforward), is to simply loop through the columns accumulating a "logical" vector of rows to check for max
in x[["score"]]
to save some memory usage:
i = logical(nrow(x))
for(j in which(startsWith(names(x), "colour"))) i = i | x[[j]] == "yellow"
max(x[["score"]][i])
#[1] 0.7
Upvotes: 1
Reputation: 5152
with native R
y=data.frame(x)
max(y$score[apply(y[,grep("^colo", colnames(y))]=="yellow",1,any)])
[1] 0.7
Upvotes: 1
Reputation: 56004
Using rowSums:
max(
x[ rowSums(x[, grepl("colour", colnames(x)), with = FALSE] == "yellow") > 0,
"score", with = FALSE]
)
Upvotes: 2