Reputation: 1176

Finding rows containing a value (or values) in any column

Say we have a table 'data' containing Strings in several columns. We want to find the indices of all rows that contain a certain value, or better yet, one of several values. The column, however, is unknown.

What I do, at the moment, is:

apply(df, 2, function(x) which(x == "M017"))

where df =

1 04.10.2009 01:24:51   M017  <NA>  <NA>    NA
2 04.10.2009 01:24:53   M018  <NA>  <NA>    NA
3 04.10.2009 01:24:54   M051  <NA>  <NA>    NA
4 04.10.2009 01:25:06   <NA>  M016  <NA>    NA
5 04.10.2009 01:25:07   <NA>  M015  <NA>    NA
6 04.10.2009 01:26:07   <NA>  M017  <NA>    NA
7 04.10.2009 01:26:27   <NA>  M017  <NA>    NA
8 04.10.2009 01:27:23   <NA>  M017  <NA>    NA
9 04.10.2009 01:27:30   <NA>  M017  <NA>    NA
10 04.10.2009 01:27:32   M017  <NA>  <NA>    NA
11 04.10.2009 01:27:34   M051  <NA>  <NA>    NA

This also works if we try to find more than one value:

apply(df, 2, function(x) which(x %in% c("M017", "M018")))

The result being:

$`1`
integer(0)

$`2`
[1]  1  2 20

$`3`
[1] 16 17 18 19

$`4`
integer(0)

$`5`
integer(0)

However, processing the resulting list of lists is rather tedious.

Is there a more efficient way to find rows that contain a value (or more) in ANY column?

Upvotes: 37

Answers (4)

Ronak Shah

Reputation: 388982

Since _at, _all functions are superseded in dplyr now, we may use if_any to select rows which has any one of the values.

library(dplyr)

df %>% filter(if_any(starts_with("col"), ~.x %in% c("M017", "M018")))

#              datetime col1 col2 col3 col4
#1  04.10.2009 01:24:51 M017 <NA> <NA>   NA
#2  04.10.2009 01:24:53 M018 <NA> <NA>   NA
#6  04.10.2009 01:26:07 <NA> M017 <NA>   NA
#7  04.10.2009 01:26:27 <NA> M017 <NA>   NA
#8  04.10.2009 01:27:23 <NA> M017 <NA>   NA
#9  04.10.2009 01:27:30 <NA> M017 <NA>   NA
#10 04.10.2009 01:27:32 M017 <NA> <NA>   NA

Upvotes: 2

sbha

Reputation: 10422

Here's a dplyr option:

library(dplyr)

# across all columns:
df %>% filter_all(any_vars(. %in% c('M017', 'M018')))

# or in only select columns:
df %>% filter_at(vars(col1, col2), any_vars(. %in% c('M017', 'M018')))

Upvotes: 9

akrun

Reputation: 887108

If you want to find the rows that have any of the values in a vector, one option is to loop the vector (lapply(v1,..)), create a logical index of (TRUE/FALSE) with (==). Use Reduce and OR (|) to reduce the list to a single logical matrix by checking the corresponding elements. Sum the rows (rowSums), double negate (!!) to get the rows with any matches.

indx1 <- !!rowSums(Reduce(`|`, lapply(v1, `==`, df)), na.rm=TRUE)

Or vectorise and get the row indices with which with arr.ind=TRUE

indx2 <- unique(which(Vectorize(function(x) x %in% v1)(df),
                                     arr.ind=TRUE)[,1])

Benchmarks

I didn't use @kristang's solution as it is giving me errors. Based on a 1000x500 matrix, @konvas's solution is the most efficient (so far). But, this may vary if the number of rows are increased

val <- paste0('M0', 1:1000)
set.seed(24)
df1 <- as.data.frame(matrix(sample(c(val, NA), 1000*500, 
  replace=TRUE), ncol=500), stringsAsFactors=FALSE) 
set.seed(356)
v1 <- sample(val, 200, replace=FALSE)

 konvas <- function() {apply(df1, 1, function(r) any(r %in% v1))}
 akrun1 <- function() {!!rowSums(Reduce(`|`, lapply(v1, `==`, df1)),
               na.rm=TRUE)}
 akrun2 <- function() {unique(which(Vectorize(function(x) x %in% 
              v1)(df1),arr.ind=TRUE)[,1])}


 library(microbenchmark)
 microbenchmark(konvas(), akrun1(), akrun2(), unit='relative', times=20L)
 #Unit: relative
 #   expr       min         lq       mean     median         uq      max   neval
 # konvas()   1.00000   1.000000   1.000000   1.000000   1.000000  1.00000    20
 # akrun1() 160.08749 147.642721 125.085200 134.491722 151.454441 52.22737    20
 # akrun2()   5.85611   5.641451   4.676836   5.330067   5.269937  2.22255    20
 # cld
 #  a 
 #  b
 #  a

For ncol = 10, the results are slighjtly different:

expr       min        lq     mean    median        uq       max    neval
 konvas()  3.116722  3.081584  2.90660  2.983618  2.998343  2.394908    20
 akrun1() 27.587827 26.554422 22.91664 23.628950 21.892466 18.305376    20
 akrun2()  1.000000  1.000000  1.00000  1.000000  1.000000  1.000000    20

data

 v1 <- c('M017', 'M018')
 df <- structure(list(datetime = c("04.10.2009 01:24:51",
"04.10.2009 01:24:53", 
"04.10.2009 01:24:54", "04.10.2009 01:25:06", "04.10.2009 01:25:07", 
"04.10.2009 01:26:07", "04.10.2009 01:26:27", "04.10.2009 01:27:23", 
"04.10.2009 01:27:30", "04.10.2009 01:27:32", "04.10.2009 01:27:34"
), col1 = c("M017", "M018", "M051", "<NA>", "<NA>", "<NA>", "<NA>", 
"<NA>", "<NA>", "M017", "M051"), col2 = c("<NA>", "<NA>", "<NA>", 
"M016", "M015", "M017", "M017", "M017", "M017", "<NA>", "<NA>"
), col3 = c("<NA>", "<NA>", "<NA>", "<NA>", "<NA>", "<NA>", "<NA>", 
"<NA>", "<NA>", "<NA>", "<NA>"), col4 = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA)), .Names = c("datetime", "col1", "col2", 
"col3", "col4"), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10", "11"))

Upvotes: 6

konvas

Reputation: 14346

How about

apply(df, 1, function(r) any(r %in% c("M017", "M018")))

The ith element will be TRUE if the ith row contains one of the values, and FALSE otherwise. Or, if you want just the row numbers, enclose the above statement in which(...).

Upvotes: 40

Finding rows containing a value (or values) in any column

Answers (4)

Benchmarks

data

Related Questions