Subset a Data Frame based on Multiple Conditions

Question

I have the following data frame (my real data frame has more rows and columns but is ostensibly structured as such):

Root_R1 = c(1,2,3,4,5)
Root_R2 = c(1,0,3,0,0)
Root_R3 = c(1,0,3,0,0)
Shoot_R1 = c(1,0,3,4,5)
Shoot_R2 = c(0,0,31,4,5)
Shoot_R3 = c(0,0,0,0,0)
data.frame(Root_R1, Root_R2, Root_R3, Shoot_R1, Shoot_R2, Shoot_R3)

Root_R1 Root_R2 Root_R3 Shoot_R1 Shoot_R2 Shoot_R3
1    Root    Root    Root    Shoot    Shoot    Shoot
2       1       1       1        1        0        0
3       2       0       0        0        0        0
4       3       3       3        3       31        0
5       4       0       0        4        4        0
6       5       0       0        5        5        0

What I would like to do is filter this data frame and find all of the rows where at least two of the columns associated with a tissue type (Root, Shoot, etc.) have values greater than 0. So, for the columns associated with "Roots" (Column 1,2,3), Rows 1 - 3 should be returned while the ones associated with "Shoots"will have Rows 4 - 6 returned. I thought a ifelse code would work but that seems inefficient. Would filter from dplyr be more appropriate?

Rui Barradas · Accepted Answer

Here is a base R solution. grep tells the "Root" columns from the "Shoot" ones. Then apply loops returning logical (row) indices and which take care of sub-setting the data.frame.

Root_R1 = c("Root",1,2,3,4,5)
Root_R2 = c("Root",1,0,3,0,0)
Root_R3 = c("Root",1,0,3,0,0)
Shoot_R1 = c("Shoot",1,0,3,4,5)
Shoot_R2 = c("Shoot",0,0,31,4,5)
Shoot_R3 = c("Shoot",0,0,0,0,0)
df1 <- data.frame(Root_R1, Root_R2, Root_R3, Shoot_R1, Shoot_R2, Shoot_R3)

df1 <- df1[-1,]
df1[] <- lapply(df1, as.integer)

root <- grep("Root", names(df1))
shoot <- grep("Shoot", names(df1))
ok_root <- which(apply(df1[root], 1, \(x) sum(x > 0L) >= 2L))
ok_shoot <- which(apply(df1[shoot], 1, \(x) sum(x > 0L) >= 2L))

df1[ok_root, ]
#>   Root_R1 Root_R2 Root_R3 Shoot_R1 Shoot_R2 Shoot_R3
#> 2       1       1       1        1        0        0
#> 4       3       3       3        3       31        0
df1[ok_shoot, ]
#>   Root_R1 Root_R2 Root_R3 Shoot_R1 Shoot_R2 Shoot_R3
#> 4       3       3       3        3       31        0
#> 5       4       0       0        4        4        0
#> 6       5       0       0        5        5        0

^{Created on 2022-06-09 by the reprex package (v2.0.1)}

Edit

Following a question in comments

And assuming I want to change the numerical value of the cut-off, I would change this part of the code (sum(x > 0L)) and if I wanted to change the number of rows that meet the cut-off, I would change this: >= 2L?

here is a function to solve the problem.

special_subset <- function(x, colpattern, cutoff = 0L, numrows = 2L) {
  i_cols <- grep(colpattern, names(x))
  ok <- which(apply(x[i_cols], 1, \(y) sum(y > cutoff) >= numrows))
  x[ok, ]
}

special_subset(df1, "Root")
#>   Root_R1 Root_R2 Root_R3 Shoot_R1 Shoot_R2 Shoot_R3
#> 2       1       1       1        1        0        0
#> 4       3       3       3        3       31        0

special_subset(df1, "Shoot", cutoff = 1)
#>   Root_R1 Root_R2 Root_R3 Shoot_R1 Shoot_R2 Shoot_R3
#> 4       3       3       3        3       31        0
#> 5       4       0       0        4        4        0
#> 6       5       0       0        5        5        0

^{Created on 2022-06-09 by the reprex package (v2.0.1)}

Edit 2

To pass more than one colpattern to the function above, use a lapply loop.

In the two exaples below, first I use the new pipe operator introduced in R 4.2.0 and in the second a standard lapply.

tissue_type <- c("Root", "Shoot")

tissue_type |>
  lapply(\(pat, data) special_subset(data, pat), data = df1)
#> [[1]]
#>   Root_R1 Root_R2 Root_R3 Shoot_R1 Shoot_R2 Shoot_R3
#> 2       1       1       1        1        0        0
#> 4       3       3       3        3       31        0
#> 
#> [[2]]
#>   Root_R1 Root_R2 Root_R3 Shoot_R1 Shoot_R2 Shoot_R3
#> 4       3       3       3        3       31        0
#> 5       4       0       0        4        4        0
#> 6       5       0       0        5        5        0

lapply(tissue_type, \(pat, data) special_subset(data, pat), data = df1)
#> [[1]]
#>   Root_R1 Root_R2 Root_R3 Shoot_R1 Shoot_R2 Shoot_R3
#> 2       1       1       1        1        0        0
#> 4       3       3       3        3       31        0
#> 
#> [[2]]
#>   Root_R1 Root_R2 Root_R3 Shoot_R1 Shoot_R2 Shoot_R3
#> 4       3       3       3        3       31        0
#> 5       4       0       0        4        4        0
#> 6       5       0       0        5        5        0

^{Created on 2022-06-17 by the reprex package (v2.0.1)}

Subset a Data Frame based on Multiple Conditions

Answers (1)

Edit

Edit 2

Related Questions