Remove columns from a dataframe based on number of rows with valid values

Question

I have a dataframe:

df = data.frame(gene = c("a", "b", "c", "d", "e"),
                value1 = c(NA, NA, NA, 2, 1),
                value2 = c(NA, 1, 2, 3, 4),
                value3 = c(NA, NA, NA, NA, 1))

I would like to keep all those columns (plus the first, gene) with more than or equal to atleast 2 valid values (i.e., not NA). How do I do this?

I am thinking something like this ...

df1 = df %>% select_if(function(.) ...)

Thanks

akrun · Accepted Answer

We can sum the non-NA elements and create a logical condition to select the columns of interest

library(dplyr)
df1 <- df %>%
          select_if(~ sum(!is.na(.)) > 2)
df1
#   gene value2
#1    a     NA
#2    b      1
#3    c      2
#4    d      3
#5    e      4

Or another option is keep

library(purrr)
keep(df, ~ sum(!is.na(.x)) > 2)

Or create the condition based on the number of rows

df %>%
   select_if(~ mean(!is.na(.)) > 0.5)

Or use Filter from base R

Filter(function(x) sum(!is.na(x)) > 2, df)

Remove columns from a dataframe based on number of rows with valid values

Answers (2)

Related Questions