Danny Friar
Danny Friar

Reputation: 393

R search in multiple datatable columns

I need to subset a data.table based on a condition across multiple columns and then perform an operation on the result.

A minimal example with a simple data.table:

x <- data.table(id=c(1, 2, 3, 4), colour1 = c('red', 'green', 'green', 'blue'), 
                colour2 = c('yellow', 'red', 'blue', 'black'), 
                colour3 = c('blue', 'black', 'red', 'yellow'),
                score = c(0.7, 0.9, 0.2, 0.35))

I then want to find the maximum score for any row that contains the colour 'yellow':

max_score <- max(x[colour1 == 'yellow' | colour2 == 'yellow' | colour3 == 'yellow']$score)

In this example, the above code works fine but is there a better way of checking this condition across multiple columns? In practice, the table will be much larger and the number of columns can change so I'd like to do this dynamically.

Upvotes: 4

Views: 308

Answers (5)

talat
talat

Reputation: 70246

It may feel difficult because your data is in a wide (i.e. "untidy") format. Just convert to long format using melt and it's really simple.

melt(x, measure.vars = grep("^colo", names(x)))[,max(score[value == "yellow"])]
# [1] 0.7

or

melt(x, measure.vars = grep("^colo", names(x)))[value == "yellow", max(score)]

To see what's going on, run this first

melt(x, measure.vars = grep("^colo", names(x)))

and then the whole command.


Of course you can do it in two steps too, in case you want to keep the long format for further calculations:

x_tidy <- melt(x, measure.vars = grep("^colo", names(x)))
x_tidy[value == "yellow", max(score)]
#[1] 0.7

Upvotes: 7

jangorecki
jangorecki

Reputation: 16697

Using R computing on the language feature.

i.cols.equal <- function(cols, value) {
    l <- lapply(cols, function(col) call("==", as.name(col), value))
    Reduce(function(a, b) bquote(.(a) | .(b)), l)
}
ii <- i.cols.equal(cols=c("colour1", "colour2", "colour3"), value="yellow")
print(ii)
#colour1 == "yellow" | colour2 == "yellow" | colour3 == "yellow"
x[eval(ii), max(score)]
#[1] 0.7

Upvotes: 1

alexis_laz
alexis_laz

Reputation: 13122

Another idea, similar to jangorecki's (probably more straightforward), is to simply loop through the columns accumulating a "logical" vector of rows to check for max in x[["score"]] to save some memory usage:

i = logical(nrow(x))
for(j in which(startsWith(names(x), "colour"))) i = i | x[[j]] == "yellow" 
max(x[["score"]][i])
#[1] 0.7

Upvotes: 1

Robert
Robert

Reputation: 5152

with native R

y=data.frame(x)
max(y$score[apply(y[,grep("^colo", colnames(y))]=="yellow",1,any)])

[1] 0.7

Upvotes: 1

zx8754
zx8754

Reputation: 56004

Using rowSums:

max(
  x[ rowSums(x[, grepl("colour", colnames(x)), with = FALSE] == "yellow") > 0,
     "score", with = FALSE]
  )

Upvotes: 2

Related Questions