Bear
Bear

Reputation: 680

Check rows from all columns that matches specific value

If I have a data.table:

d <- data.table("ID" = c(1, 2, 2, 4, 6, 6), 
                "TYPE" = c(1, 1, 2, 2, 3, 3), 
                "CLASS" = c(1, 2, 3, 4, 5, 6))

I know I can remove values greater than a specific value like this:

r <- d[!(d$TYPE > 2), ]

However, if I want to apply this to all of the columns in the entire table instead of just TYPE (basically drop any rows that have a value > 2 in the entire table), how would I generalize the above statement (avoiding using a for loop if possible).

I know I can do d > 2 resulting in a boolean index table, but if I put that into the above line of code it give me an error:

d[!d>2, ]

Results in a invalid matrix type

Note

It was brought up that this question is similar to Return an entire row if the value in any specific set of columns meets a certain criteria. However, they are working with a data.frame and I am working with a data.table the notation is different. Not a duplicate question due to that.

Upvotes: 1

Views: 219

Answers (2)

Uwe
Uwe

Reputation: 42544

I was wondering what the fastest approach would be for a varying number of rows and columns.

So, here is a benchmark.

It excludes the ID column from being checked for which is not exactly in line with OP's question but is a sensible decision, IMHO.

library(data.table)
library(bench)
bm <- press(
  n_row = c(1E1, 1E3, 1E5),
  n_col = c(2, 10, 50),
  {  
    set.seed(1L)
    d <- data.table(
      ID = seq_len(n_row),
      matrix(sample(10, n_row*n_col, TRUE), ncol = n_col)
    )
    mark(
      m1 = d[d[, !apply(.SD > 2, 1, any), .SDcols = -"ID"]],
      m2 = d[!d[, apply(.SD > 2, 1, any), .SDcols = -"ID"]],
      m3 = d[!d[, which(apply(.SD > 2, 1, any)), .SDcols = -"ID"]],
      m4 = d[d[, rowSums(.SD > 2) == 0, .SDcols = -"ID"]],
      m5 = d[!d[, Reduce(any, lapply(.SD, `>`, y = 2)), by = 1:nrow(d), .SDcols = -"ID"]$V1]
    )
  })

ggplot2::autoplot(bm)

enter image description here

Apparently, the rowSums() approach is almost always the fastest method.

Upvotes: 2

BENY
BENY

Reputation: 323226

Using apply with any

d[!apply(d>2,1,any)]
   ID TYPE CLASS
1:  1    1     1
2:  2    1     2

Or rowSums

d[rowSums(d>2)==0,]
   ID TYPE CLASS
1:  1    1     1
2:  2    1     2

Upvotes: 2

Related Questions