pdubois
pdubois

Reputation: 7790

How to filter rows for every column independently using dplyr

I have the following tibble:


library(tidyverse)
df <- tibble::tribble(
  ~gene, ~colB, ~colC,
  "a",   1,  2,
  "b",   2,  3,
  "c",   3,  4,
  "d",   1,  1
)

df
#> # A tibble: 4 x 3
#>    gene  colB  colC
#>   <chr> <dbl> <dbl>
#> 1     a     1     2
#> 2     b     2     3
#> 3     c     3     4
#> 4     d     1     1

What I want to do is to filter every columns after gene column for values greater or equal 2 (>=2). Resulting in this:

gene, colB, colC
a   NA   2
b   2    3
c   3    4

How can I achieve that?

The number of columns after genes actually is more than just 2.

Upvotes: 2

Views: 2500

Answers (4)

jkatam
jkatam

Reputation: 3447

Alternatively we could also try the below code

df %>% rowwise %>% 
filter(any(c_across(starts_with('col'))>=2)) %>% 
mutate(across(starts_with('col'), ~ifelse(!(.>=2), NA, .)))

Created on 2023-02-05 with reprex v2.0.2

# A tibble: 3 × 3
# Rowwise: 
  gene   colB  colC
  <chr> <dbl> <dbl>
1 a        NA     2
2 b         2     3
3 c         3     4

Upvotes: 0

akrun
akrun

Reputation: 886948

We can use data.table

library(data.table)
setDT(df)[df[, Reduce(`|`, lapply(.SD, `>=`, 2)), .SDcols = colB:colC]
   ][, (2:3) := lapply(.SD, function(x) replace(x, x < 2, NA)), .SDcols = colB:colC][]
#   gene colB colC
#1:    a   NA    2
#2:    b    2    3
#3:    c    3    4

Or with melt/dcast

dcast(melt(setDT(df), id.var = 'gene')[value>=2], gene ~variable)
#   gene colB colC
#1:    a   NA    2
#2:    b    2    3
#3:    c    3    4

Upvotes: 0

alistaire
alistaire

Reputation: 43334

The forthcoming dplyr 0.6 (install from GitHub now, if you like) has filter_at, which can be used to filter to any rows that have a value greater than or equal to 2, and then na_if can be applied similarly through mutate_at, so

df %>% 
    filter_at(vars(-gene), any_vars(. >= 2)) %>% 
    mutate_at(vars(-gene), funs(na_if(., . < 2)))
#> # A tibble: 3 x 3
#>    gene  colB  colC
#>   <chr> <dbl> <dbl>
#> 1     a    NA     2
#> 2     b     2     3
#> 3     c     3     4

or similarly,

df %>% 
    mutate_at(vars(-gene), funs(na_if(., . < 2))) %>% 
    filter_at(vars(-gene), any_vars(!is.na(.)))

which can be translated for use with dplyr 0.5:

df %>% 
    mutate_at(vars(-gene), funs(na_if(., . < 2))) %>% 
    filter(rowSums(is.na(.)) < (ncol(.) - 1))

All return the same thing.

Upvotes: 5

neilfws
neilfws

Reputation: 33772

One solution: convert from wide to long format, so you can filter on just one column, then convert back to wide at the end if required. Note that this will drop genes where no values meet the condition.

library(tidyverse)
df %>% 
gather(name, value, -gene) %>% 
  filter(value >= 2) %>% 
  spread(name, value)

# A tibble: 3 x 3
   gene  colB  colC
* <chr> <dbl> <dbl>
1     a    NA     2
2     b     2     3
3     c     3     4

Upvotes: 5

Related Questions