Remove % of Items in Columns

Question

I'm trying to drop columns that have more than 90% of NA values present, I've followed the following but I only get a values in return, not sure what I can be doing wrong. I would be expecting an actual data frame, I tried putting as.data.frame in front but this is also erroneous.

Linked Post: Delete columns/rows with more than x% missing

Example DF

gene cell1 cell2 cell3 
A    0.4   0.1   NA
B    NA    NA    0.1
C    0.4   NA    0.5
D    NA    NA    0.5
E    0.5   NA    0.6
F    0.6   NA    NA

Desired DF

gene cell1  cell3 
A    0.4     NA
B    NA      0.1
C    0.4     0.5
D    NA      0.5
E    0.5     0.6
F    0.6     NA

Code

#Select Genes that have NA values for 90% of a given cell line
df_col <- df[,2:ncol(df)]
df_col <-df_col[, which(colMeans(!is.na(df_col)) > 0.9)]
df <- cbind(df[,1], df_col)

GuedesBF · Accepted Answer

I would use dplyr here.

If you want to use select() with logical conditions, you are probably looking for the where() selection helper in dplyr. It can be used like this: select(where(condition))

I used a 80% threshold because 90% would keep all columns and would therefore not illustrate the solution as well

library(dplyr)

df %>% select(where(~mean(is.na(.))<0.8))

It can also be done with base R and colMeans:

df[, c(TRUE, colMeans(is.na(df[-1]))<0.8)]

or with purrr:

library(purrr)

df %>% keep(~mean(is.na(.))<0.8)

Output:

  gene cell1 cell3
1    a   0.4    NA
2    b    NA   0.1
3    c   0.4   0.5
4    d    NA   0.5
5    e   0.5   0.6
6    f   0.6    NA

Data

df<-data.frame(gene=letters[1:6],
cell1=c(0.4, NA, 0.4, NA, 0.5, 0.6),
cell2=c(0.1, rep(NA, 5)),
cell3=c(NA, 0.1, 0.5, 0.5, 0.6, NA))

Remove % of Items in Columns

Example DF

Desired DF

Code

Answers (2)

Related Questions